Token-level Direct Preference Optimization

2404.11999

Published 4/19/2024 by Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang

Token-level Direct Preference Optimization

Abstract

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces a novel approach called Token-level Direct Preference Optimization (TDPO) for training large language models (LLMs) to align with human preferences.
TDPO directly optimizes the model's preferences at the token-level, rather than relying on intermediate reward functions or value models.
The authors demonstrate that TDPO outperforms existing alignment methods like Preference Learning and Debate on a range of alignment tasks.

Plain English Explanation

The paper presents a new way to train large language models (LLMs) so that they behave in alignment with human preferences. Instead of using intermediate steps like reward functions or value models, the authors' approach, called Token-level Direct Preference Optimization (TDPO), directly optimizes the model's preferences at the level of individual words or "tokens."

The key idea is that by directly optimizing the model's preferences, it can learn to generate text that is more closely aligned with what humans would prefer, without needing to go through additional modeling steps. The authors show that TDPO outperforms other existing alignment methods, like Preference Learning and Debate, on a variety of alignment tasks. This suggests that directly optimizing the model's preferences may be a more effective way to ensure LLMs behave in alignment with human values and preferences.

Technical Explanation

The paper proposes a new approach called Token-level Direct Preference Optimization (TDPO) for aligning large language models (LLMs) with human preferences. Unlike existing methods that rely on intermediate reward functions or value models, TDPO directly optimizes the model's preferences at the level of individual tokens.

The key technical insight is that by directly optimizing the model's preferences, it can learn to generate text that is more closely aligned with human preferences, without needing to go through additional modeling steps. The authors formulate the alignment problem as a constrained optimization problem, where the objective is to maximize the model's likelihood of generating text that is preferred by humans, subject to constraints on the model's capabilities.

To implement TDPO, the authors introduce a novel training procedure that involves iteratively refining the model's preferences based on human feedback. This feedback is used to update the model's parameters in a way that encourages it to generate text that is more aligned with human preferences.

The authors evaluate TDPO on a range of alignment tasks, including video generation, language modeling, and multi-modal reasoning. Their results demonstrate that TDPO outperforms existing alignment methods, such as Preference Learning and Debate, on these tasks.

Critical Analysis

The paper presents a promising new approach for aligning large language models with human preferences, but there are a few potential limitations and areas for further research:

The authors only evaluate TDPO on a limited set of tasks and datasets, and it's unclear how well the approach would generalize to a wider range of alignment problems. More extensive testing on diverse alignment tasks would be helpful to better understand the strengths and limitations of TDPO.
The paper does not provide a detailed analysis of the computational and sample complexity of TDPO, which could be an important consideration for real-world deployments of the method. Understanding the tradeoffs between alignment performance and efficiency would be valuable.
The authors do not discuss potential issues related to the interpretability and transparency of the TDPO-aligned models. As these models become more widely deployed, it will be important to address concerns around the explainability of their behaviors and decisions.
The paper does not explore the robustness of TDPO to distributional shift or adversarial attacks. Evaluating the provable robustness of TDPO-aligned models would be an important direction for future research.

Overall, the Token-level Direct Preference Optimization approach presented in this paper represents an interesting and promising step forward in the field of AI alignment. However, additional research is needed to fully understand the capabilities and limitations of this approach.

Conclusion

This paper introduces a novel method called Token-level Direct Preference Optimization (TDPO) for aligning large language models (LLMs) with human preferences. Unlike existing approaches that rely on intermediate reward functions or value models, TDPO directly optimizes the model's preferences at the level of individual tokens.

The authors' experiments show that TDPO outperforms other alignment methods on a range of tasks, suggesting that directly optimizing the model's preferences may be a more effective way to ensure LLMs behave in alignment with human values and preferences. However, the paper also highlights a few potential limitations and areas for further research, such as the need for more extensive evaluations, analysis of computational efficiency, and exploration of model interpretability and robustness.

As the development of powerful LLMs continues, techniques like TDPO will become increasingly important for ensuring these models are aligned with human preferences and values. The insights and findings presented in this paper represent an important contribution to the ongoing efforts to create safe and beneficial artificial intelligence systems.

Related Papers

Filtered Direct Preference Optimization

Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on Direct Preference Optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.

4/24/2024

cs.LG cs.AI cs.CL

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Amir Saeidi, Shivanshu Verma, Chitta Baral

Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.

4/24/2024

cs.CL

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.

4/23/2024

cs.CL

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference, first we theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-tun dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

4/19/2024

cs.LG