Filtered Direct Preference Optimization

2404.13846

Published 4/24/2024 by Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

Abstract

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on Direct Preference Optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces a new technique called Filtered Direct Preference Optimization (FDPO) for aligning large language models with human preferences.
FDPO builds on previous work in Direct Preference Optimization and Provably Robust DPO to address some of the shortcomings of those approaches.
The authors also present new theoretical results on the convergence and robustness of FDPO.

Plain English Explanation

The goal of this research is to develop better ways to align large language models, like those used in chatbots and writing assistants, with human preferences and values. This is an important challenge because these models can sometimes produce outputs that are harmful, biased, or misaligned with what humans want.

The key idea behind Filtered Direct Preference Optimization (FDPO) is to use human feedback to fine-tune the language model, but with an additional "filtering" step that helps the model focus on the most important aspects of the feedback. This is intended to make the optimization process more efficient and robust, addressing some limitations of previous approaches like Direct Preference Optimization and Provably Robust DPO.

The authors also provide new mathematical analysis showing that FDPO has good theoretical properties in terms of convergence and robustness to noise in the human feedback. This gives more confidence that the technique will work well in practice.

Technical Explanation

The paper introduces a new approach called Filtered Direct Preference Optimization (FDPO) for aligning large language models with human preferences. FDPO builds on previous work in Direct Preference Optimization (DPO) and Provably Robust DPO, addressing some of the limitations of those approaches.

The key innovation in FDPO is the addition of a "filtering" step that aims to focus the optimization on the most important aspects of the human feedback. This is done by learning a separate model that predicts the relevance or importance of each piece of feedback, and then using that to weight the contributions during the optimization process.

The authors also provide new theoretical results, showing that FDPO has good convergence properties and is robust to noise in the human feedback. This builds on the prior work on provable robustness for DPO.

Experiments on language model fine-tuning tasks demonstrate the benefits of FDPO compared to simpler DPO approaches. The authors find that FDPO leads to models that are better aligned with human preferences while also being more efficient to train.

Critical Analysis

The paper presents a promising new technique for aligning large language models with human preferences. The authors acknowledge that there are still limitations and areas for further research, such as:

The need for more comprehensive evaluation, including on real-world deployed systems rather than just simulated setups.
The challenge of scaling FDPO to the largest language models, which may require further algorithmic innovations.
The potential for human feedback to be biased or inconsistent, which could limit the effectiveness of preference-based optimization approaches.

Additionally, one could question whether the theoretical guarantees provided for FDPO fully capture the complexities of real-world human feedback and preferences. There may be further practical challenges that arise when deploying these techniques in the messy reality of interactive AI systems.

Overall, though, this appears to be a valuable contribution to the important problem of aligning powerful AI systems with human values. The authors have demonstrated a thoughtful approach that builds on prior work and shows promising empirical results. Continued research in this direction could lead to significant advances in making AI systems more reliable and trustworthy.

Conclusion

This paper introduces a new technique called Filtered Direct Preference Optimization (FDPO) for aligning large language models with human preferences. FDPO builds on previous work in Direct Preference Optimization and Provably Robust DPO, addressing some of the limitations of those approaches.

The key innovation in FDPO is the addition of a "filtering" step that aims to focus the optimization on the most important aspects of the human feedback. The authors also provide new theoretical results showing that FDPO has good convergence properties and is robust to noise.

Experiments demonstrate the benefits of FDPO compared to simpler DPO approaches, leading to models that are better aligned with human preferences while also being more efficient to train. However, the authors acknowledge that there are still limitations and areas for further research, such as the need for more comprehensive evaluation and the challenge of scaling to the largest language models.

Overall, this paper represents an important step forward in the crucial task of aligning powerful AI systems with human values and preferences. Continued progress in this direction could lead to significant improvements in the reliability and trustworthiness of AI technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Token-level Direct Preference Optimization

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

4/19/2024

cs.CL cs.AI

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference, first we theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-tun dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

4/19/2024

cs.LG

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(frac{1}{1-2epsilon}sqrt{frac{d}{n}})$, where $epsilon < 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

4/15/2024

cs.LG cs.CL

D2PO: Discriminator-Guided DPO with Response Evaluation Models

Prasann Singhal, Nathan Lambert, Scott Niekum, Tanya Goyal, Greg Durrett

Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.

5/3/2024

cs.CL