Aligning Crowd Feedback via Distributional Preference Reward Modeling

Read original: arXiv:2402.09764 - Published 5/31/2024 by Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu

🤷

Overview

Deep Reinforcement Learning is widely used to align Large Language Models (LLMs) with human preferences
Conventional reward modeling relies heavily on human annotations provided by a limited group of individuals
This can lead to biased models that don't adequately represent the broader population's expectations
The paper proposes the Distributional Preference Reward Model (DPRM) to better align LLMs with diverse human preferences

Plain English Explanation

The paper addresses a key challenge in using Deep Reinforcement Learning to align Large Language Models (LLMs) with human preferences. Typically, this process relies on human-provided annotations to define the desired behavior, but the researchers argue that this can result in models that are skewed towards the preferences of the annotators rather than the wider population.

To address this, the paper introduces the Distributional Preference Reward Model (DPRM), a framework that characterizes multiple preferences as a categorical distribution. This allows the model to accommodate shifts or new preferences over time, rather than being locked into the initial set of annotations. The researchers also designed an optimal-transportation-based loss function to calibrate the DPRM and align it with the preference distribution.

By using the expected reward from the DPRM, the researchers were able to fine-tune an LLM policy to generate responses that are more accurate, unbiased, and contextually appropriate for the broader population, rather than just a select group of annotators.

Technical Explanation

The paper proposes the Distributional Preference Reward Model (DPRM), a framework to better align Large Language Models (LLMs) with diverse human preferences. Conventional reward modeling for aligning language models with human preferences relies heavily on human annotations provided by a limited group of individuals, which can lead to biased models that fail to adequately represent the wider population's expectations.

To address this, the researchers characterize multiple preferences as a categorical distribution and introduce a Bayesian updater to accommodate shifted or new preferences over time. They also design an optimal-transportation-based loss function to calibrate the DPRM and align it with the preference distribution.

Finally, the expected reward from the DPRM is used to fine-tune an LLM policy, enabling the generation of responses that are more accurate, unbiased, and contextually appropriate for the broader population, as demonstrated in the experiments.

Critical Analysis

The paper presents a novel approach to aligning Large Language Models (LLMs) with human preferences, addressing an important limitation of conventional reward modeling techniques. By characterizing preferences as a categorical distribution and introducing a Bayesian updater, the Distributional Preference Reward Model (DPRM) can more effectively accommodate shifting or new preferences over time, reducing the risk of model bias.

However, the paper does not discuss the potential challenges in obtaining a representative and diverse set of human preferences to train the DPRM. There may be practical difficulties in ensuring that the preference distribution accurately reflects the views of the broader population, especially for complex or sensitive topics.

Additionally, the paper does not explore the computational complexity and scalability of the DPRM framework, which could be a concern when applying it to large-scale LLMs. Further research may be needed to investigate the trade-offs between the model's complexity, training efficiency, and alignment accuracy.

Conclusion

The Distributional Preference Reward Model (DPRM) proposed in this paper represents a promising approach to aligning large language models with human preferences in a more inclusive and representative manner. By characterizing preferences as a categorical distribution and using an optimal-transportation-based loss, the framework can generate responses that are more accurate, unbiased, and contextually appropriate for the broader population.

This research contributes to the ongoing efforts to develop AI systems that are better aligned with human values and expectations, which is crucial for the safe and responsible deployment of large language models in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

Aligning Crowd Feedback via Distributional Preference Reward Modeling

Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu

Deep Reinforcement Learning is widely used for aligning Large Language Models (LLM) with human preference. However, the conventional reward modelling is predominantly dependent on human annotations provided by a select cohort of individuals. Such dependence may unintentionally result in skewed models that reflect the inclinations of these annotators, thereby failing to adequately represent the wider population's expectations. We propose the Distributional Preference Reward Model (DPRM), a simple yet effective framework to align large language models with diverse human preferences. To this end, we characterize multiple preferences by a categorical distribution and introduce a Bayesian updater to accommodate shifted or new preferences. On top of that, we design an optimal-transportation-based loss to calibrate DPRM to align with the preference distribution. Finally, the expected reward is utilized to fine-tune an LLM policy to generate responses favoured by the population. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.

5/31/2024

Robust Preference Optimization through Reward Model Distillation

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

5/30/2024

💬

Privately Aligning Language Models with Reinforcement Learning

Fan Wu, Huseyin A. Inan, Arturs Backurs, Varun Chandrasekaran, Janardhan Kulkarni, Robert Sim

Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.

5/6/2024

💬

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

7/31/2024