Robust Preference Optimization with Provable Noise Tolerance for LLMs

2404.04102

Published 5/29/2024 by Xize Liang, Chao Chen, Shuang Qiu, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, Jieping Ye

Robust Preference Optimization with Provable Noise Tolerance for LLMs

Abstract

Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data. Recent efforts for this problem either marginally alleviate the impact of noise without the ability to actually reduce its presence, or rely on costly teacher LLMs prone to reward misgeneralization. To address these challenges, we propose the RObust Preference Optimization (ROPO) framework, an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Specifically, ROPO iteratively solves a constrained optimization problem, where we dynamically assign a quality-aware weight for each sample and constrain the sum of the weights to the number of samples we intend to retain. For noise-tolerant training and effective noise identification, we derive a robust loss by suppressing the gradients of samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is critical for distinguishing noisy samples from clean ones. Furthermore, inspired by our derived loss, we propose a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods, with its superiority growing as the noise rate increases.

Create account to get full access

Overview

This paper proposes a novel approach for optimizing preferences in large language models (LLMs) with provable noise tolerance.
The method, called Robust Preference Optimization (RPO), aims to learn stable and reliable preferences from noisy human feedback, making LLMs more robust and trustworthy.
RPO leverages techniques from online convex optimization and game theory to learn preferences that are resilient to noise and inconsistent feedback.

Plain English Explanation

The paper discusses a new way to train large language models (LLMs) to learn preferences and make decisions in a more robust and reliable way. LLMs are powerful AI systems that can generate human-like text, but they can sometimes be influenced by biased or inconsistent feedback from humans during the training process.

The researchers developed a method called Robust Preference Optimization (RPO) that helps LLMs learn stable and trustworthy preferences, even when the human feedback they receive is noisy or inconsistent. RPO uses techniques from online convex optimization and game theory to find preferences that are resilient to these types of issues.

This is important because it can help make LLMs more reliable and aligned with human values, which is crucial as these models become more powerful and influential. By training LLMs to learn preferences in a more robust way, the researchers aim to create AI systems that are less prone to biases or erratic behavior, and more likely to make decisions that are consistent with what humans actually want.

Technical Explanation

The paper introduces a novel approach called Robust Preference Optimization (RPO) for learning preferences in large language models (LLMs) that are resilient to noisy and inconsistent human feedback. RPO leverages techniques from online convex optimization and game theory to find preferences that are stable and reliable, even in the face of noisy or contradictory inputs.

The key idea behind RPO is to formulate the preference learning problem as a two-player game, where the LLM is trying to optimize its preferences, and an adversary is trying to perturb the feedback to make the LLM's preferences unstable. The LLM then learns preferences that are optimal against the worst-case perturbations from the adversary, resulting in preferences that are provably robust to noise.

The paper provides theoretical guarantees on the performance of RPO, showing that it can learn near-optimal preferences even when a constant fraction of the human feedback is corrupted. The researchers also demonstrate the effectiveness of RPO through extensive experiments on realistic preference learning tasks, where RPO outperforms alternative methods in terms of both preference quality and noise tolerance.

Critical Analysis

The paper presents a compelling approach for making large language models more robust and reliable in their preference learning. The key theoretical contributions, such as the game-theoretic formulation and the provable noise tolerance guarantees, are significant and could have a meaningful impact on the field of AI safety and alignment.

However, the paper also acknowledges some limitations and avenues for further research. For example, the current implementation of RPO assumes that the noise in the human feedback is adversarial, which may not always be the case in real-world settings. Extending the approach to handle other types of noise, such as random or structured noise, could further improve its practical applicability.

Additionally, the paper focuses on preference learning in isolation, but in many real-world scenarios, LLMs need to balance multiple objectives, such as task performance, safety, and alignment with human values. Integrating RPO into a broader framework for multi-objective optimization in LLMs could be an interesting direction for future work.

Overall, the paper presents a solid contribution to the field of AI safety and alignment, and the Robust Preference Optimization method is a promising step towards building more trustworthy and reliable large language models.

Conclusion

This paper introduces a novel approach called Robust Preference Optimization (RPO) that aims to make large language models (LLMs) more robust and reliable in their preference learning. By formulating the problem as a two-player game and leveraging techniques from online convex optimization and game theory, RPO can learn preferences that are provably resilient to noisy and inconsistent human feedback.

The theoretical guarantees and experimental results presented in the paper demonstrate the effectiveness of RPO, suggesting that it could be a valuable tool for building more trustworthy and aligned LLMs. As these models become increasingly influential, developing methods like RPO to ensure their preferences and decision-making are stable and reliable is crucial for their safe and responsible deployment.

While the paper identifies some limitations and areas for future research, the Robust Preference Optimization approach represents an important step forward in the field of AI safety and alignment, with the potential to significantly improve the robustness and trustworthiness of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(frac{1}{1-2epsilon}sqrt{frac{d}{n}})$, where $epsilon < 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

4/15/2024

cs.LG cs.CL

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Yueqin Yin, Zhendong Wang, Yi Gu, Hai Huang, Weizhu Chen, Mingyuan Zhou

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training process. Our code can be viewed at https://github.com/yinyueqin/relative-preference-optimization

5/29/2024

cs.CL cs.AI cs.LG

LiPO: Listwise Preference Optimization through Learning-to-Rank

Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang

Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$lambda$, which leverages a state-of-the-art textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.

5/24/2024

cs.CL cs.LG

Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization

Yi Gu, Zhendong Wang, Yueqin Yin, Yujia Xie, Mingyuan Zhou

Aligning large language models with human preferences has emerged as a critical focus in language modeling research. Yet, integrating preference learning into Text-to-Image (T2I) generative models is still relatively uncharted territory. The Diffusion-DPO technique made initial strides by employing pairwise preference learning in diffusion models tailored for specific text prompts. We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively. This approach leverages both prompt-image pairs with identical prompts and those with semantically related content across various modalities. Furthermore, we have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low reproducibility, and limited interpretability prevalent in current evaluations of human preference alignment. Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0, achieving superior results in both automated evaluations of human preferences and style alignment. Our code is available at https://github.com/yigu1008/Diffusion-RPO

6/11/2024

cs.CV cs.CL cs.LG