A Long Way to Go: Investigating Length Correlations in RLHF

Read original: arXiv:2310.03716 - Published 7/12/2024 by Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

📉

Overview

This paper examines the impact of Reinforcement Learning from Human Feedback (RLHF) on the output length of large language models.
RLHF has been successful in aligning models to be more helpful in tasks like dialogue and web question answering, but it often leads to longer model outputs.
The paper demonstrates that optimizing for response length is a significant factor behind the improvements seen with RLHF, rather than other desired features.
The authors identify the reward models used in RLHF as the dominant source of these length biases, which they find to be non-robust and easily influenced by length biases in the preference data.

Plain English Explanation

The paper looks at a technique called Reinforcement Learning from Human Feedback (RLHF), which has been used to make large language models more helpful and aligned with human preferences. For example, RLHF has been used to improve models' performance on tasks like engaging in helpful dialogues or answering questions from the web.

However, the researchers found that when using RLHF, the language models tend to produce longer responses. The paper shows that this increase in response length is a major factor behind the improvements seen with RLHF, rather than the models actually learning to be more helpful in other ways.

By studying how the RLHF optimization process works, the authors discovered that the reward models used to guide the training are the main source of these length biases. These reward models, which are trained on human preferences, turn out to be easily influenced by the length of the responses, leading the models to simply produce longer outputs to get higher rewards.

The researchers tested various ways to counter this length bias, but found the problem to be quite difficult to solve. Their analysis suggests the reward models themselves are not very robust and struggle to avoid being swayed by the length of the responses, even when other desirable features are present.

Technical Explanation

The paper investigates the impact of Reinforcement Learning from Human Feedback (RLHF) on the output length of large language models. RLHF has been widely used to align models to be more helpful in tasks like dialogue and web question answering, but the authors find that it often leads to longer model outputs.

Through experiments across three diverse settings, the paper demonstrates that optimizing for response length is a significant factor behind the improvements seen with RLHF, rather than other desirable features. The authors study the strategies the RL optimization process uses to maximize reward, and find that the improvements in reward are largely driven by increasing response length.

In fact, the researchers show that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. To counter this bias, they test a comprehensive set of length-countering interventions, but identify the dominant source of these biases to be the reward models used in RLHF.

By analyzing the training dynamics, the paper finds that these reward models are non-robust and easily influenced by length biases in the preference data used to train them. This suggests the length bias is a critical issue that needs to be addressed to ensure RLHF produces models aligned with human preferences, rather than just longer outputs.

Critical Analysis

The paper provides a thorough and well-designed analysis of the impact of RLHF on model output length. The researchers' comprehensive experimental approach, which tests interventions across diverse settings, lends credibility to their findings.

However, the paper does not delve deeply into the potential reasons why the reward models used in RLHF are so susceptible to length biases. While the authors identify this as the dominant source of the problem, more investigation into the underlying causes could lead to more effective solutions.

Additionally, the paper does not address potential ways to mitigate the length bias issues beyond the interventions tested. Further research into alternative reward modeling approaches or architectural changes to the RLHF framework could yield more robust solutions.

The authors also acknowledge the limitations of their study, noting that they focus on output length as the primary factor, while other aspects of model behavior may also be influenced by RLHF in complex ways. Exploring these additional factors could provide a more holistic understanding of the tradeoffs and challenges involved in aligning large language models through RLHF.

Conclusion

This paper makes an important contribution to the understanding of Reinforcement Learning from Human Feedback (RLHF) and its impact on the behavior of large language models. The key finding is that the improvements seen with RLHF are often driven more by an optimization for longer responses, rather than other desirable features like increased helpfulness or alignment with human preferences.

The authors identify the reward models used in RLHF as the primary source of these length biases, which they find to be non-robust and easily influenced by length biases in the preference data. This suggests that addressing the length bias issue is critical for ensuring RLHF produces models that are truly aligned with human values, rather than just models that generate longer outputs.

The paper's detailed analysis and comprehensive experimental approach provide valuable insights for researchers and practitioners working on the challenging problem of aligning large language models through reinforcement learning from human feedback. The findings highlight the need for more robust reward modeling techniques and a deeper understanding of the complex tradeoffs involved in this emerging field of AI alignment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

A Long Way to Go: Investigating Length Correlations in RLHF

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for helpfulness in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.

7/12/2024

Disentangling Length from Quality in Direct Preference Optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn

Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. However, RLHF is know to exploit biases in human preferences, such as verbosity. A well-formatted and eloquent answer is often more highly rated by users, even when it is less helpful and objective. A number of approaches have been developed to control those biases in the classical RLHF literature, but the problem remains relatively under-explored for Direct Alignment Algorithms such as Direct Preference Optimization (DPO). Unlike classical RLHF, DPO does not train a separate reward model or use reinforcement learning directly, so previous approaches developed to control verbosity cannot be directly applied to this setting. Our work makes several contributions. For the first time, we study the length problem in the DPO setting, showing significant exploitation in DPO and linking it to out-of-distribution bootstrapping. We then develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality. We demonstrate these effects across datasets on summarization and dialogue, where we achieve up to 20% improvement in win rates when controlling for length, despite the GPT4 judge's well-known verbosity bias.

9/10/2024

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande, Bruno Castro da Silva

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.

4/17/2024

Rethinking LLM-based Preference Evaluation

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Jingang Wang, Zhenyu Chen, Hui Xiong

The use of large language model (LLM)-based preference evaluations has become widespread for comparing model responses, but it has revealed a notable bias towards longer responses, questioning the reliability of such evaluations. This paper explores the length bias in LLM evaluations from a data-centric perspective, analyzing 14 commonly used preference datasets and 10 reward models. Our findings indicate that human preference labeling favors longer responses and this spurious correlation is learned by the reward model and subsequently propagated to the aligned model during training. We decompose the preference evaluation metric, i.e., win rate, from the perspective of human to identify the deeper factors and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. Controlled experiments demonstrate that response length impacts evaluations by influencing information mass. To ensure reliable evaluation metrics that assess content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model's answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation. Furthermore, we investigate length bias in DPO using AlpacaEval and AdapAlpaca. By testing Tulu2 and Tulu2-dpo at 7B, 13B, and 70B scales, we found that DPO leads to higher human preference, but this gain is amplified by response length, with AlpacaEval showing higher win rates gain than AdapAlpaca.

8/12/2024