Direct Multi-Turn Preference Optimization for Language Agents

2406.14868

Published 6/24/2024 by Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng

🛠️

Abstract

Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.

Create account to get full access

Overview

Adapting large language models (LLMs) for agent tasks is crucial for developing effective language agents.
Direct Preference Optimization (DPO) is a promising technique for this adaptation, as it can help alleviate compounding errors and directly optimize Reinforcement Learning (RL) objectives.
Applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function.
The paper introduces a novel loss function called DMPO that addresses these challenges by replacing the policy constraint with the state-action occupancy measure constraint and adding length normalization to the Bradley-Terry model.

Plain English Explanation

Large language models (LLMs) are powerful tools that can be used to create intelligent language agents, such as virtual assistants or chatbots. However, directly adapting these LLMs for agent tasks can be difficult due to the potential for compounding errors. Direct Preference Optimization (DPO) is a technique that aims to address this by allowing researchers to directly optimize the Reinforcement Learning (RL) objectives for these agent tasks.

When applying DPO to multi-turn tasks, where the agent engages in a back-and-forth conversation, there are additional challenges. The main problem is that the partition function, which is used to normalize the probabilities in the RL objective, cannot be easily cancelled out. This makes it difficult to optimize the objective directly.

To overcome this obstacle, the researchers propose a novel loss function called DMPO (Direct Multi-turn Preference Optimization). DMPO does two key things:

It replaces the policy constraint with a constraint on the state-action occupancy measure. This makes the partition function independent of the current state, making it easier to optimize.
It adds length normalization to the Bradley-Terry model, which is used to compare the preferences of different conversational trajectories. This helps address the issue of length disparities between preferred and dis-preferred trajectories.

By making these changes, the researchers were able to develop a loss function that can be effectively used for optimizing LLMs in multi-turn agent tasks. Extensive experiments on several datasets confirmed the effectiveness and superiority of the DMPO loss function compared to other approaches.

Technical Explanation

The paper proposes a novel loss function, called DMPO (Direct Multi-turn Preference Optimization), to adapt large language models (LLMs) for multi-turn agent tasks. This is an extension of the Direct Preference Optimization (DPO) technique, which has been shown to be effective for single-turn agent tasks.

The key challenges in applying DPO to multi-turn tasks are:

The inability to cancel the partition function in the Reinforcement Learning (RL) objective, which is necessary for direct optimization.
Length disparities between preferred and dis-preferred trajectories in multi-turn conversations.

To address these challenges, the DMPO loss function makes two key changes:

It replaces the policy constraint with a constraint on the state-action occupancy measure. This makes the partition function independent of the current state, allowing for direct optimization.
It adds length normalization to the Bradley-Terry model used to compare preferred and dis-preferred trajectories. This helps address the issue of length disparities.

The paper provides detailed theoretical explanations for these changes and how they overcome the challenges of applying DPO to multi-turn tasks.

The effectiveness of DMPO is demonstrated through extensive experiments on three multi-turn agent task datasets, including Mallows DPO and Hybrid Preference Optimization. The results show that DMPO outperforms other approaches, confirming its superiority for adapting LLMs to multi-turn agent tasks.

Critical Analysis

The paper presents a compelling solution to the challenges of applying DPO to multi-turn agent tasks. The DMPO loss function effectively addresses the issues of the partition function and length disparities, making it a promising approach for adapting LLMs to these types of tasks.

However, the paper does not discuss potential limitations or areas for further research. For example, it would be valuable to understand how DMPO performs on a wider range of multi-turn agent tasks, including more complex or open-ended conversations. Additionally, the paper does not explore the computational efficiency of the DMPO approach compared to other methods, which could be an important consideration for real-world applications.

Furthermore, while the theoretical explanations provided in the paper are detailed and well-written, it would be helpful to see a more intuitive explanation of the core ideas behind DMPO. This could make the research more accessible to a broader audience and encourage further exploration and discussion of the approach.

Overall, the DMPO loss function represents an important step forward in adapting LLMs for multi-turn agent tasks. However, further research and analysis could help strengthen the impact and adoption of this promising technique.

Conclusion

The paper introduces a novel loss function called DMPO (Direct Multi-turn Preference Optimization) that enables the effective adaptation of large language models (LLMs) for multi-turn agent tasks. By addressing the key challenges of the partition function and length disparities, DMPO provides a solution that outperforms other approaches in extensive experiments.

The DMPO loss function represents a significant advancement in the field of language agent development, as it allows researchers to directly optimize Reinforcement Learning objectives for these complex, multi-turn tasks. This has the potential to lead to more capable and reliable virtual assistants, chatbots, and other language-based agents that can engage in natural, contextual conversations.

While the paper provides a strong technical foundation for DMPO, further research and analysis could help expand its reach and impact. Exploring the limitations, computational efficiency, and more intuitive explanations of the approach could make it more accessible and applicable to a wider range of real-world scenarios.

Overall, the DMPO loss function is a promising contribution to the ongoing efforts to adapt powerful LLMs for language agent tasks, with the potential to drive significant advancements in this rapidly evolving field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen

Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.

6/18/2024

cs.CV cs.AI cs.CL cs.LG

Token-level Direct Preference Optimization

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

6/28/2024

cs.CL cs.AI

Direct Preference Optimization with an Offset

Afra Amini, Tim Vieira, Ryan Cotterell

Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal. Sometimes, the preferred response is only slightly better than the dispreferred one. In other cases, the preference is much stronger. For instance, if a response contains harmful or toxic content, the annotator will have a strong preference for that response. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning language models, especially when the number of preference pairs is limited.

6/7/2024

cs.CL cs.AI cs.LG

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu

For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to tune language models to easily maximize non-differentiable and non-binary objectives according to the LLM designer's preferences (e.g., using simpler language or minimizing specific kinds of harmful content). These may neither align with user preferences nor even be able to be captured tractably by binary preference data. To leverage the simplicity and performance of DPO with the generalizability of RL, we propose a hybrid approach between DPO and RLHF. With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards using offline RL. The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives, while preserving alignment performance across a range of challenging benchmarks and model sizes.

5/31/2024

cs.AI