Disentangling Length from Quality in Direct Preference Optimization

Read original: arXiv:2403.19159 - Published 9/10/2024 by Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn

Disentangling Length from Quality in Direct Preference Optimization

Overview

This paper explores how to disentangle text length from quality in direct preference optimization (DPO) models.
DPO models are trained to directly optimize a quality metric, but they often exhibit an undesirable reliance on text length.
The researchers propose techniques to eliminate this length bias and improve the quality of generated text.

Plain English Explanation

The paper examines a common problem in text generation models called direct preference optimization (DPO). These models are trained to directly optimize for a measure of quality, like how coherent or informative the text is. However, the researchers found that these models often end up relying too heavily on the length of the generated text, instead of truly optimizing for quality.

For example, a DPO model might generate very long passages that score highly on the quality metric, even if much of the content is repetitive or irrelevant. The paper proposes several techniques to "disentangle" length from quality, so the model focuses on generating high-quality text without being biased by length.

These techniques include adding explicit length penalties during training, using filters to remove low-quality long text, and modifying the objective function to directly account for both length and quality. By addressing this length bias, the researchers aim to improve the overall quality and usefulness of text generated by DPO models.

Technical Explanation

The paper first provides background on direct preference optimization (DPO) models, which are trained to directly optimize a quality metric rather than using a traditional language modeling approach.

The key insight is that DPO models often exhibit an undesirable reliance on text length, generating longer passages that score highly on the quality metric even if much of the content is repetitive or irrelevant. The researchers call this the "length bias" problem.

To address this, they propose three main techniques:

Iterative Length Regularization: Adding an explicit length penalty term to the optimization objective to discourage long, low-quality text.
Filtered DPO: Using a filter to remove low-quality long text, forcing the model to focus on generating high-quality text of any length.
Desiderata-Based DPO: Modifying the objective function to directly account for both length and quality, rather than optimizing quality alone.

The researchers evaluate these techniques on several text generation tasks and find that they are effective at eliminating the length bias and improving the overall quality of generated text.

Critical Analysis

The paper provides a thoughtful analysis of an important issue in direct preference optimization models - their tendency to favor longer text regardless of quality. The proposed solutions seem well-designed and the experimental results are promising.

One potential limitation is that the techniques may not generalize well to more open-ended text generation tasks, where the appropriate length can vary greatly depending on the context. The researchers acknowledge this and suggest further research is needed.

Additionally, the paper does not deeply explore the underlying reasons for the length bias in DPO models. Understanding the root causes could lead to even more effective solutions. It would also be interesting to see how these techniques perform compared to more traditional language modeling approaches.

Overall, this paper makes a valuable contribution by shedding light on an important challenge in text generation and proposing practical techniques to address it. The findings could have significant implications for building more reliable and useful language models.

Conclusion

This paper tackles an important problem in direct preference optimization models - their tendency to favor longer text regardless of quality. The researchers propose several techniques to "disentangle" length from quality, including adding length penalties, using filters, and modifying the objective function.

By addressing this length bias, the researchers aim to improve the overall quality and usefulness of text generated by DPO models. The findings could have broad implications for building more reliable and effective language models across a variety of applications.

While the techniques show promise, further research is needed to fully understand the root causes of the length bias and explore how the solutions generalize to more open-ended text generation tasks. Nonetheless, this paper makes a valuable contribution to the field of text generation and optimization.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Disentangling Length from Quality in Direct Preference Optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn

Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. However, RLHF is know to exploit biases in human preferences, such as verbosity. A well-formatted and eloquent answer is often more highly rated by users, even when it is less helpful and objective. A number of approaches have been developed to control those biases in the classical RLHF literature, but the problem remains relatively under-explored for Direct Alignment Algorithms such as Direct Preference Optimization (DPO). Unlike classical RLHF, DPO does not train a separate reward model or use reinforcement learning directly, so previous approaches developed to control verbosity cannot be directly applied to this setting. Our work makes several contributions. For the first time, we study the length problem in the DPO setting, showing significant exploitation in DPO and linking it to out-of-distribution bootstrapping. We then develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality. We demonstrate these effects across datasets on summarization and dialogue, where we achieve up to 20% improvement in win rates when controlling for length, despite the GPT4 judge's well-known verbosity bias.

9/10/2024

Length Desensitization in Directed Preference Optimization

Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, Xunliang Cai

Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO's optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-real preferences.

9/11/2024

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, Wanli Ouyang

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

6/18/2024

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di Yin, Xing Sun

Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: verbosity, a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback-Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate the presence of biased rewards. We then introduce an effective downsampling approach, named SamPO, to eliminate potential length reliance. Our experimental evaluations, conducted across three LLMs of varying scales and a diverse array of conditional and open-ended benchmarks, highlight the efficacy of SamPO in mitigating verbosity, achieving improvements of 5% to 12% over DPO through debaised rewards. Our codes can be accessed at: https://github.com/LuJunru/SamPO/.

8/16/2024