Orthogonal Finetuning for Direct Preference Optimization

Read original: arXiv:2409.14836 - Published 9/25/2024 by Chenxu Yang, Ruipeng Jia, Naibin Gu, Zheng Lin, Siyuan Chen, Chao Pang, Weichong Yin, Yu Sun, Hua Wu, Weiping Wang

Orthogonal Finetuning for Direct Preference Optimization

Overview

This paper proposes a novel technique called "Orthogonal Finetuning" for directly optimizing a language model's preferences to align with a given objective.
The approach aims to address limitations of existing direct preference optimization methods, such as instability and the tendency to overfit to the training distribution.
Experimental results on language modeling and preference learning tasks demonstrate the effectiveness of Orthogonal Finetuning in improving alignment and robustness.

Plain English Explanation

The paper introduces a new method called "Orthogonal Finetuning" for training language models to directly optimize their preferences towards a desired objective. This is an important problem in AI alignment, as we want language models to behave in ways that are consistent with human values and preferences.

Previous methods for direct preference optimization have faced challenges, such as instability during training and a tendency to overfit to the specific data used for optimization. Orthogonal Finetuning aims to address these issues by decoupling the preference optimization from the core language modeling objective.

The key idea is to first train a language model on a large corpus of text, then fine-tune it using a separate preference optimization objective. Crucially, the fine-tuning process is designed to be orthogonal to the original language modeling objective, ensuring that the preferences are learned without significantly degrading the model's general language understanding capabilities.

Through experiments on language modeling and preference learning tasks, the authors demonstrate that Orthogonal Finetuning can lead to more stable, robust, and effective preference optimization compared to previous approaches. This represents an important step forward in the quest to develop AI systems that reliably behave in alignment with human values.

Technical Explanation

The paper introduces a new technique called "Orthogonal Finetuning" for directly optimizing a language model's preferences to align with a given objective. The approach builds on previous work in direct preference optimization, which aims to train language models to exhibit preferences that directly match a specified target.

The key innovation in Orthogonal Finetuning is the use of a two-stage training process. First, the model is trained on a large corpus of text to develop strong general language understanding capabilities. Then, the model is fine-tuned using a separate preference optimization objective, but in a way that ensures the fine-tuning is orthogonal to the original language modeling objective.

This orthogonality property helps address limitations of previous direct preference optimization methods, such as instability during training and a tendency to overfit to the specific data used for optimization. By decoupling the preference optimization from the core language modeling objective, Orthogonal Finetuning can improve alignment and robustness.

The authors evaluate Orthogonal Finetuning on both language modeling and preference learning tasks, demonstrating its effectiveness compared to prior approaches. For example, on a language modeling benchmark, Orthogonal Finetuning achieved state-of-the-art performance while also exhibiting stronger alignment with the target preferences.

Critical Analysis

The Orthogonal Finetuning approach represents an important advance in the field of direct preference optimization for language models. By explicitly separating the preference optimization from the core language modeling objective, the method helps address key limitations of previous approaches, such as instability and overfitting.

However, the paper does not fully explore the limitations and potential drawbacks of Orthogonal Finetuning. For instance, the authors do not discuss how the method might scale to more complex preference objectives or larger language models. There are also open questions about the computational and memory costs of the two-stage training process.

Additionally, the paper focuses on preference alignment in isolated task settings, but does not consider how the optimized preferences might interact with the model's behavior in more open-ended, real-world scenarios. Potential negative side effects or unintended consequences of the preference optimization are not thoroughly examined.

Further research is needed to better understand the long-term implications of Orthogonal Finetuning and to explore ways to make the approach more scalable and robust. Careful evaluation of the method's safety and reliability in diverse contexts will be crucial as the field of AI alignment continues to advance.

Conclusion

The Orthogonal Finetuning technique proposed in this paper represents an important step forward in the quest to develop language models that reliably behave in alignment with human values and preferences. By decoupling the preference optimization from the core language modeling objective, the method addresses key limitations of previous direct preference optimization approaches, leading to improved alignment and robustness.

The authors' experimental results demonstrate the effectiveness of Orthogonal Finetuning on both language modeling and preference learning tasks. This suggests the technique could have broad applicability in building AI systems that exhibit preferences and behaviors consistent with human-specified objectives.

While the paper does not fully explore the limitations and potential drawbacks of the method, it lays the groundwork for further research and development in this critical area of AI alignment. As the field continues to advance, techniques like Orthogonal Finetuning will play a crucial role in ensuring that powerful language models are imbued with preferences that are well-aligned with human values and interests.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!Orthogonal Finetuning for Direct Preference Optimization

Chenxu Yang, Ruipeng Jia, Naibin Gu, Zheng Lin, Siyuan Chen, Chao Pang, Weichong Yin, Yu Sun, Hua Wu, Weiping Wang

DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.

9/25/2024

New Desiderata for Direct Preference Optimization

Xiangkun Hu, Tong He, David Wipf

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses.

7/15/2024

💬

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. Utilizing Distributionally Robust Optimization (DRO), we enhance DPO's resilience to these types of noise. Our theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient $beta$ playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter $beta'$ in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.

7/11/2024

Direct Preference Optimization with an Offset

Afra Amini, Tim Vieira, Ryan Cotterell

Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal. Sometimes, the preferred response is only slightly better than the dispreferred one. In other cases, the preference is much stronger. For instance, if a response contains harmful or toxic content, the annotator will have a strong preference for that response. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning language models, especially when the number of preference pairs is limited.

6/7/2024