D2PO: Discriminator-Guided DPO with Response Evaluation Models

2405.01511

Published 5/3/2024 by Prasann Singhal, Nathan Lambert, Scott Niekum, Tanya Goyal, Greg Durrett

D2PO: Discriminator-Guided DPO with Response Evaluation Models

Abstract

Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces a new method called D2PO (Discriminator-Guided DPO with Response Evaluation Models) for training language models to align with human preferences.
D2PO builds on previous work in Provably Robust DPO and Filtered Direct Preference Optimization, aiming to address some of the limitations of those approaches.
The key idea is to use a discriminator model to guide the optimization process, providing feedback on the quality of generated responses.
Experiments on language modeling tasks show that D2PO can outperform previous DPO methods in terms of alignment with human preferences.

Plain English Explanation

The goal of this research is to develop better ways to train language models so that they behave in alignment with human values and preferences. Previous methods like Provably Robust DPO and Filtered Direct Preference Optimization have made progress, but they still have some limitations.

The key innovation in this paper is the use of a "discriminator" model. This is a neural network that is trained to evaluate the quality of the language model's responses. During training, the language model tries to generate responses that the discriminator judges to be high-quality and aligned with human preferences.

This discriminator-guided approach seems to work better than the previous methods, according to the experiments the researchers conducted. The language models trained with D2PO were better at generating responses that humans found desirable, compared to models trained with other DPO variants.

Overall, this work represents an important step forward in the challenge of aligning language models with human values. By incorporating feedback from a discriminator model, the D2PO method helps push language models to behave in ways that are more in tune with human preferences.

Technical Explanation

The D2PO method builds on previous work in Direct Preference Optimization (DPO), which aims to directly optimize language models to generate responses that align with human preferences. However, DPO has some limitations, such as the difficulty of collecting high-quality preference data.

To address these issues, the D2PO approach incorporates a discriminator model that is trained to evaluate the quality of the language model's responses. This discriminator provides ongoing feedback to the language model during the training process, guiding it to generate outputs that the discriminator judges to be desirable.

Specifically, the D2PO training procedure consists of the following steps:

Train a base language model on a standard language modeling objective.
Train a discriminator model to predict the quality or "preference score" of the language model's responses.
Fine-tune the language model using a DPO-style objective, but with the discriminator's preference scores used to guide the optimization.

The researchers show that this discriminator-guided approach can outperform previous DPO methods on language modeling tasks, producing responses that better align with human preferences.

Critical Analysis

The D2PO method represents an interesting step forward in the challenge of aligning language models with human values. By incorporating a discriminator model to provide ongoing feedback, the approach seems to address some of the limitations of earlier DPO methods.

However, the paper does acknowledge some important caveats and areas for further research. For example, the discriminator model itself may be biased or have blind spots, which could then be reflected in the language model's outputs. There are also open questions about how to best train and scale the discriminator model.

Additionally, the experiments in this paper are focused on language modeling tasks, so it's unclear how well the D2PO approach would generalize to other domains or more open-ended interactions. More research would be needed to fully understand the capabilities and limitations of this method.

Overall, the D2PO method is a promising direction, but there is still much work to be done to achieve robust and reliable alignment of language models with human preferences. Continued research and experimentation in this area will be crucial as language models become more powerful and influential.

Conclusion

The D2PO method represents an important step forward in the challenge of aligning language models with human values and preferences. By incorporating a discriminator model to provide ongoing feedback during training, the approach seems to produce language models that generate more desirable responses compared to previous DPO methods.

While the paper identifies some caveats and areas for further research, the overall results are encouraging and suggest that discriminator-guided approaches could be a fruitful direction for future work in this domain. As language models become more powerful and influential, developing techniques to ensure their behavior aligns with human values will only become more crucial. The D2PO method is a valuable contribution to this critical challenge.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(frac{1}{1-2epsilon}sqrt{frac{d}{n}})$, where $epsilon < 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

4/15/2024

cs.LG cs.CL

Filtered Direct Preference Optimization

Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on Direct Preference Optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.

4/24/2024

cs.LG cs.AI cs.CL

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.

4/23/2024

cs.CL

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference, first we theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-tun dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

4/19/2024

cs.LG