Rethinking LLM-based Preference Evaluation

Read original: arXiv:2407.01085 - Published 8/12/2024 by Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Jingang Wang, Zhenyu Chen, Hui Xiong

Rethinking LLM-based Preference Evaluation

Overview

Evaluates the reliability and biases of large language models (LLMs) for preference evaluation tasks
Identifies key factors that influence the win rate of LLM-based preference evaluation
Proposes methods to improve the accuracy and robustness of LLM-based preference evaluation

Plain English Explanation

This paper examines the use of large language models (LLMs) for preference evaluation, which is the task of determining which of two options is preferred. The researchers investigate the major factors that influence the "win rate" - the likelihood that an LLM correctly identifies the preferred option.

One key finding is that the length of the input text has a significant impact on the win rate. LLMs tend to be biased towards shorter inputs, often overlooking important context or nuance in longer passages. The researchers propose a length-controlled evaluation approach to address this issue.

The paper also explores other potential sources of bias, such as the specific format of the input (e.g., whether it's presented as natural language or as a structured comparison). The researchers demonstrate that LLMs can be systematically biased towards certain output formats, which can lead to inaccurate preference evaluations.

Overall, this research highlights the need to carefully consider the limitations and biases of LLMs when using them for preference evaluation tasks. The findings and proposed solutions can help improve the reliability and robustness of LLM-based preference evaluation, with potential applications in areas like product recommendation, policy decision-making, and user experience design.

Technical Explanation

The paper Rethinking LLM-based Preference Evaluation investigates the factors that influence the win rate of large language models (LLMs) in preference evaluation tasks. Preference evaluation involves determining which of two options is preferred, and is a common task in areas like product recommendation, policy decision-making, and user experience design.

The researchers conducted a series of experiments to understand the major factors that affect the win rate of LLM-based preference evaluation. They found that the length of the input text is a significant factor, with LLMs tending to be biased towards shorter inputs. To address this, the paper proposes a length-controlled evaluation approach that adjusts the input length to improve the accuracy and robustness of the preference evaluation.

Additionally, the researchers explored other potential sources of bias, such as the specific format of the input (e.g., natural language vs. structured comparison). They found that LLMs can be systematically biased towards certain output formats, leading to inaccurate preference evaluations.

The paper's findings highlight the need to carefully consider the limitations and biases of LLMs when using them for preference evaluation tasks. The proposed solutions, such as the length-controlled evaluation approach, can help improve the reliability and robustness of LLM-based preference evaluation, with potential applications in a variety of domains.

Critical Analysis

The paper provides a thorough and well-designed investigation into the factors that influence the performance of LLMs in preference evaluation tasks. The researchers have identified several key issues, such as the bias towards shorter inputs and the format-specific biases, which are important considerations for practitioners using LLMs in real-world applications.

However, the paper does not fully address the potential limitations of the proposed solutions. For example, the length-controlled evaluation approach may not be suitable for all types of preference evaluation tasks, as the optimal input length may vary depending on the specific context and the complexity of the options being compared.

Additionally, the paper does not discuss the generalizability of the findings across different LLM architectures and training datasets. It would be beneficial to explore whether the observed biases and the proposed solutions are applicable to a broader range of LLMs, beyond the specific models used in the study.

Further research could also investigate the potential impact of other factors, such as the semantic complexity of the input text, the level of subjectivity in the preference evaluation, and the influence of cultural or demographic biases. Exploring these additional dimensions could provide a more comprehensive understanding of the challenges and potential solutions for LLM-based preference evaluation.

Conclusion

This paper presents a significant contribution to the understanding of the limitations and biases of large language models (LLMs) in preference evaluation tasks. The researchers have identified key factors, such as input length and format, that can significantly impact the win rate of LLM-based preference evaluation.

The proposed solutions, like the length-controlled evaluation approach, offer promising avenues to improve the reliability and robustness of LLM-based preference evaluation. These insights have important implications for the development and deployment of LLMs in a wide range of applications, from product recommendation to policy decision-making.

As the use of LLMs continues to grow, it is crucial to understand and address their biases and limitations. This paper's findings and the proposed methods represent an important step towards more accurate and trustworthy preference evaluation systems, with the potential to have a significant impact on both the research and practical applications of these powerful language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Rethinking LLM-based Preference Evaluation

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Jingang Wang, Zhenyu Chen, Hui Xiong

The use of large language model (LLM)-based preference evaluations has become widespread for comparing model responses, but it has revealed a notable bias towards longer responses, questioning the reliability of such evaluations. This paper explores the length bias in LLM evaluations from a data-centric perspective, analyzing 14 commonly used preference datasets and 10 reward models. Our findings indicate that human preference labeling favors longer responses and this spurious correlation is learned by the reward model and subsequently propagated to the aligned model during training. We decompose the preference evaluation metric, i.e., win rate, from the perspective of human to identify the deeper factors and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. Controlled experiments demonstrate that response length impacts evaluations by influencing information mass. To ensure reliable evaluation metrics that assess content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model's answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation. Furthermore, we investigate length bias in DPO using AlpacaEval and AdapAlpaca. By testing Tulu2 and Tulu2-dpo at 7B, 13B, and 70B scales, we found that DPO leads to higher human preference, but this gain is amplified by response length, with AlpacaEval showing higher win rates gain than AdapAlpaca.

8/12/2024

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal'azs Galambosi, Percy Liang, Tatsunori B. Hashimoto

LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for chat LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: What would the preference be if the model's and baseline's output had the same length?. To achieve this, we first fit a generalized linear model to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. We release the code and leaderboard at https://tatsu-lab.github.io/alpaca_eval/ .

4/9/2024

Disentangling Length from Quality in Direct Preference Optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn

Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. However, RLHF is know to exploit biases in human preferences, such as verbosity. A well-formatted and eloquent answer is often more highly rated by users, even when it is less helpful and objective. A number of approaches have been developed to control those biases in the classical RLHF literature, but the problem remains relatively under-explored for Direct Alignment Algorithms such as Direct Preference Optimization (DPO). Unlike classical RLHF, DPO does not train a separate reward model or use reinforcement learning directly, so previous approaches developed to control verbosity cannot be directly applied to this setting. Our work makes several contributions. For the first time, we study the length problem in the DPO setting, showing significant exploitation in DPO and linking it to out-of-distribution bootstrapping. We then develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality. We demonstrate these effects across datasets on summarization and dialogue, where we achieve up to 20% improvement in win rates when controlling for length, despite the GPT4 judge's well-known verbosity bias.

9/10/2024

📉

A Long Way to Go: Investigating Length Correlations in RLHF

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for helpfulness in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.

7/12/2024