Averaging log-likelihoods in direct alignment

2406.19188

YC

0

Reddit

0

Published 6/28/2024 by Nathan Grinsztajn, Yannis Flet-Berliac, Mohammad Gheshlaghi Azar, Florian Strub, Bill Wu, Eugene Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Olivier Pietquin and 1 other
Averaging log-likelihoods in direct alignment

Abstract

To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involving the log-likelihood of (dis)preferred completions according to the trained model. However, completions have various lengths, and the log-likelihood is not length-invariant. On the other side, the cross-entropy loss used in supervised training is length-invariant, as batches are typically averaged token-wise. To reconcile these approaches, we introduce a principled approach for making direct alignment length-invariant. Formally, we introduce a new averaging operator, to be composed with the optimality operator giving the best policy for the underlying RL problem. It translates into averaging the log-likelihood within the loss. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

ā€¢ This paper explores the use of averaging log-likelihoods as a technique for directly aligning large language models (LLMs) with desired objectives.

ā€¢ The researchers investigate the effectiveness of this approach compared to other methods like Reinforcement Learning from Human Feedback (RLHF) and Contrastive Policy Gradient.

ā€¢ The paper also examines the potential for reward model overoptimization, a known issue in direct alignment approaches, and proposes ways to address it.

Plain English Explanation

The paper focuses on a technique called "averaging log-likelihoods" for aligning large language models (LLMs) with desired objectives. This means finding ways to make the models behave in a way that matches our goals, rather than just optimizing for maximum likelihood on the training data.

The researchers compare this approach to other methods like RLHF and Contrastive Policy Gradient, which also aim to align LLMs. They look at how well each method works and whether they might run into issues like "reward model overoptimization," where the model becomes too focused on maximizing the reward signal rather than behaving as intended.

Technical Explanation

The paper explores the use of "averaging log-likelihoods" as a technique for directly aligning large language models (LLMs) with desired objectives. This involves modifying the training process to optimize not just for the standard language modeling objective, but also for an additional term that encourages the model's outputs to match a target distribution.

The researchers compare this approach to other direct alignment methods, such as RLHF and Contrastive Policy Gradient. They analyze the performance of each method on various tasks and investigate the potential for reward model overoptimization, a known issue in direct alignment approaches.

To address reward model overoptimization, the paper proposes techniques like offline regularized reinforcement learning and explores the role of scaling laws in this context.

Critical Analysis

The paper provides a thorough exploration of the averaging log-likelihoods approach and its comparison to other direct alignment methods. The researchers acknowledge the potential for reward model overoptimization and offer strategies to mitigate this issue, which is a valuable contribution to the field.

However, the paper does not delve into the potential limitations or downsides of the averaging log-likelihoods technique itself. It would be helpful to understand any challenges or drawbacks that may arise from this approach, such as its scalability, computational complexity, or potential negative impacts on model performance or safety.

Additionally, the paper could benefit from a more in-depth discussion of the broader implications and societal considerations of aligning large language models with specific objectives. As these models become increasingly powerful and influential, it is crucial to consider the ethical and responsible development of such technologies.

Conclusion

This paper presents a novel technique for directly aligning large language models with desired objectives, called "averaging log-likelihoods." The researchers compare this approach to other methods, such as RLHF and Contrastive Policy Gradient, and explore ways to address the issue of reward model overoptimization.

The findings contribute to the ongoing efforts to develop more effective and responsible techniques for aligning large language models with societal and ethical objectives. While the paper provides a thorough technical analysis, further exploration of the limitations and broader implications of this approach could strengthen the overall contribution to the field.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Aligning Large Language Models via Fine-grained Supervision

Aligning Large Language Models via Fine-grained Supervision

Dehong Xu, Liang Qiu, Minseok Kim, Faisal Ladhak, Jaeyoung Do

YC

0

Reddit

0

Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.

Read more

6/6/2024

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J. Su

YC

0

Reddit

0

Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.

Read more

5/28/2024

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion

Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion

Yannis Flet-Berliac, Nathan Grinsztajn, Florian Strub, Eugene Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Mohammad Gheshlaghi Azar, Olivier Pietquin, Matthieu Geist

YC

0

Reddit

0

Reinforcement Learning (RL) has been used to finetune Large Language Models (LLMs) using a reward model trained from preference data, to better align with human judgment. The recently introduced direct alignment methods, which are often simpler, more stable, and computationally lighter, can more directly achieve this. However, these approaches cannot optimize arbitrary rewards, and the preference-based ones are not the only rewards of interest for LLMs (eg., unit tests for code generation or textual entailment for summarization, among others). RL-finetuning is usually done with a variation of policy gradient, which calls for on-policy or near-on-policy samples, requiring costly generations. We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data. It can be seen as an off-policy policy gradient approach that does not rely on important sampling techniques and highlights the importance of using (the right) state baseline. We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient. We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task, using a learned reward function considered as ground truth for the purpose of the experiments.

Read more

6/28/2024

šŸ…

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot

YC

0

Reddit

0

The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder language models, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.

Read more

5/30/2024