Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

2404.10719

Published 4/23/2024 by Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Abstract

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper compares two reinforcement learning techniques, Proximal Policy Optimization (PPO) and Distant Preference Optimization (DPO), for aligning large language models (LLMs) with human preferences.
The researchers evaluate the performance of these methods on a range of tasks, including language generation, reasoning, and safety.
The paper presents a comprehensive analysis of the strengths and weaknesses of PPO and DPO for LLM alignment, providing insights that can inform the development of more effective AI alignment techniques.

Plain English Explanation

The paper explores two different approaches, Proximal Policy Optimization (PPO) and Distant Preference Optimization (DPO), for training large language models (LLMs) to behave in a way that aligns with human preferences. The researchers compare the performance of these methods on various tasks, such as generating human-like text, answering questions, and avoiding harmful or unethical behavior.

The key idea behind both PPO and DPO is to modify the training process of the LLM to encourage it to produce outputs that are closer to what humans would prefer. PPO does this by directly optimizing the model's policy to match a reference policy that represents human preferences. In contrast, DPO works by training the model to minimize the distance between its outputs and a set of "exemplar" outputs that are deemed to be desirable.

The paper provides a detailed evaluation of these approaches, examining their strengths, weaknesses, and trade-offs. For example, the researchers find that DPO may be better at maintaining safety and avoiding undesirable behavior, while PPO may be more effective at generating high-quality, human-like text. The paper also discusses the theoretical underpinnings of these methods and explores potential ways to further improve their performance.

Overall, this research offers valuable insights into the challenge of aligning LLMs with human values and preferences, a crucial issue as these models become increasingly powerful and influential. The findings can help guide the development of more effective AI alignment techniques, which will be essential for ensuring that advanced AI systems behave in ways that are beneficial to humanity.

Technical Explanation

The paper compares two reinforcement learning techniques, Proximal Policy Optimization (PPO) and Distant Preference Optimization (DPO), for the task of aligning large language models (LLMs) with human preferences.

PPO is a popular algorithm that directly optimizes the model's policy to match a reference policy, which represents the desired behavior. In contrast, DPO trains the model to minimize the distance between its outputs and a set of "exemplar" outputs that are deemed to be desirable, as described in this paper.

The researchers evaluate the performance of PPO and DPO on a range of tasks, including language generation, reasoning, and safety. They find that DPO may be more effective at maintaining safety and avoiding undesirable behavior, while PPO may be better at generating high-quality, human-like text.

The paper also provides a theoretical analysis of the two methods, exploring their strengths, weaknesses, and the trade-offs involved in their use. For example, the researchers discuss how DPO's focus on minimizing distance to exemplars can make it more robust to distributional shift, while PPO's direct optimization of the policy can lead to faster convergence in some cases.

Furthermore, the paper examines the practical considerations of implementing these techniques, such as the choice of reference policy or exemplar set, and the impact of hyperparameter tuning on the final results.

Critical Analysis

The paper provides a comprehensive and thoughtful analysis of the performance of PPO and DPO for LLM alignment, highlighting the nuanced trade-offs between the two methods. However, the researchers also acknowledge several limitations and areas for further exploration.

One potential concern is the reliance on a fixed set of "exemplar" outputs in the DPO approach. As the authors note, the choice of these exemplars can have a significant impact on the model's behavior, and it may be challenging to ensure that the exemplar set is truly representative of human preferences. Additionally, the paper does not explore the sensitivity of DPO to the size and diversity of the exemplar set, which could be an important area for future research.

Another limitation is the focus on a relatively narrow set of tasks and evaluation metrics. While the researchers have made a commendable effort to assess the performance of PPO and DPO across a range of domains, it is possible that there are other relevant tasks or metrics that could provide additional insights into the relative strengths and weaknesses of these approaches.

Furthermore, the paper does not delve into the ethical implications of LLM alignment, such as the potential for these techniques to be used to manipulate or control human behavior. As this paper suggests, careful consideration of the societal impact of AI alignment methods is essential.

Overall, the paper represents a valuable contribution to the ongoing research on aligning large language models with human preferences. The insights provided can help inform the development of more effective and ethical AI alignment techniques, which will be crucial as these powerful models become increasingly prevalent in our lives.

Conclusion

This paper presents a comprehensive evaluation of two reinforcement learning techniques, Proximal Policy Optimization (PPO) and Distant Preference Optimization (DPO), for aligning large language models (LLMs) with human preferences. The researchers find that DPO may be more effective at maintaining safety and avoiding undesirable behavior, while PPO may be better at generating high-quality, human-like text.

The paper provides detailed technical explanations of the two methods, as well as a critical analysis of their strengths, weaknesses, and areas for further research. The findings offer valuable insights that can inform the development of more effective AI alignment techniques, which will be essential for ensuring that advanced AI systems behave in ways that are beneficial to humanity.

While the paper represents a significant contribution to the field, the researchers also acknowledge the need for further exploration of the ethical implications of LLM alignment and the potential sensitivity of the approaches to the choice of reference policy or exemplar set. Addressing these challenges will be crucial as the field of AI alignment continues to evolve.

Related Papers

DPO Meets PPO: Reinforced Token Optimization for RLHF

Han Zhong, Guhao Feng, Wei Xiong, Li Zhao, Di He, Jiang Bian, Liwei Wang

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive real-world alignment experiments verify the effectiveness of the proposed approach.

4/30/2024

cs.LG cs.AI cs.CL stat.ML

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Amir Saeidi, Shivanshu Verma, Chitta Baral

Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.

4/24/2024

cs.CL

Learn Your Reference Model for Real Good Alignment

Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov

The complexity of the alignment problem stems from the fact that existing methods are unstable. Researchers continuously invent various tricks to address this shortcoming. For instance, in the fundamental Reinforcement Learning From Human Feedback (RLHF) technique of Language Model alignment, in addition to reward maximization, the Kullback-Leibler divergence between the trainable policy and the SFT policy is minimized. This addition prevents the model from being overfitted to the Reward Model (RM) and generating texts that are out-of-domain for the RM. The Direct Preference Optimization (DPO) method reformulates the optimization task of RLHF and eliminates the Reward Model while tacitly maintaining the requirement for the policy to be close to the SFT policy. In our paper, we argue that this implicit limitation in the DPO method leads to sub-optimal results. We propose a new method called Trust Region DPO (TR-DPO), which updates the reference policy during training. With such a straightforward update, we demonstrate the effectiveness of TR-DPO against DPO on the Anthropic HH and TLDR datasets. We show that TR-DPO outperforms DPO by up to 19%, measured by automatic evaluation with GPT-4. The new alignment approach that we propose allows us to improve the quality of models across several parameters at once, such as coherence, correctness, level of detail, helpfulness, and harmlessness.

4/16/2024

cs.LG cs.CL

Token-level Direct Preference Optimization

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

4/19/2024

cs.CL cs.AI