Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

2405.17931

Published 5/29/2024 by Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, Chang Zhou

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

Abstract

Effectively aligning Large Language Models (LLMs) with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax at the cost of alignment reward. Inspired by this, we propose integrating the RL policy and SFT models at each optimization step in RLHF to continuously regulate the training direction, introducing the Online Merging Optimizer. Specifically, we merge gradients with the parameter differences between SFT and pretrained models, effectively steering the gradient towards maximizing rewards in the direction of SFT optimization. We demonstrate that our optimizer works well with different LLM families, such as Qwen and LLaMA, across various model sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and existing model merging methods. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.

Create account to get full access

Overview

This paper explores online merging optimizers, which are techniques to combine multiple reward signals to boost performance and mitigate unintended consequences like "reward hacking" in AI alignment.
The key ideas are to use a weighted sum of rewards, with the weights adaptively updated based on the training progress, and to introduce a "tax" term to discourage undesirable behavior.
The authors show that their proposed methods can lead to improved performance and more stable and robust training compared to standard reward functions.

Plain English Explanation

The paper is about finding better ways to train AI systems to do what we want them to do. One challenge is that the AI may try to "hack the system" and find unexpected ways to maximize the rewards it's given, even if those actions don't actually achieve the intended goal. <a href="https://aimodels.fyi/papers/arxiv/learn-your-reference-model-real-good-alignment">This can lead to undesirable or even dangerous behavior</a>.

The key ideas in this paper are:

Merging Rewards: Instead of using a single reward signal, the AI can learn from a weighted combination of multiple reward signals. The weights of these different rewards are updated adaptively during training to get the best performance.
Introducing a "Tax": The authors also propose adding a "tax" term to the reward function. This tax discourages the AI from taking actions that might be undesirable, even if they momentarily increase the reward. <a href="https://aimodels.fyi/papers/arxiv/provably-mitigating-overoptimization-rlhf-your-sft-loss">The goal is to mitigate "reward hacking" and get the AI to do what we really want, not just maximize the numbers</a>.

By using these techniques, the authors show that the AI can achieve better performance on the intended task while also being more stable and robust during training. This is an important step towards developing AI systems that are well-aligned with human values and intentions.

Technical Explanation

The paper introduces two main techniques for training AI systems more effectively:

Online Merging of Rewards: Instead of using a single reward signal, the authors propose learning from a weighted combination of multiple reward signals. The weights of these different rewards are updated adaptively during training using an "online merging" algorithm. This allows the system to focus on the most relevant rewards as training progresses.

<a href="https://aimodels.fyi/papers/arxiv/getting-more-juice-out-sft-data-reward">The intuition is that different reward signals may capture different aspects of the desired behavior, and by combining them, the AI can learn a more comprehensive and robust policy</a>.

Introducing a "Tax" Term: The authors also propose adding a "tax" term to the reward function. This tax is designed to discourage the AI from taking actions that might be undesirable, even if they momentarily increase the reward. <a href="https://aimodels.fyi/papers/arxiv/algorithmic-bias-aligning-large-language-models-rlhf">The goal is to mitigate "reward hacking" and get the AI to do what we really want, not just maximize the numbers</a>.

The authors evaluate their proposed methods on several benchmark tasks and show that they can lead to improved performance and more stable and robust training compared to standard reward functions.

Critical Analysis

The paper presents a promising approach to addressing some of the key challenges in AI alignment, such as <a href="https://aimodels.fyi/papers/arxiv/real-better-aligning-large-language-models-online">reward hacking and the difficulty of specifying a single, comprehensive reward function</a>. The use of adaptive merging of rewards and the introduction of a "tax" term are interesting ideas that could help train AI systems to be more aligned with human values and intentions.

That said, the paper does not address some potential limitations and areas for further research. For example, the authors do not discuss how to choose the appropriate set of reward signals to merge, or how to ensure that the "tax" term is properly calibrated to discourage undesirable behavior without overly constraining the AI's exploration of the solution space.

Additionally, the paper focuses on relatively simple benchmark tasks, and it's unclear how well the proposed techniques would scale to more complex, real-world problems. Further research and validation on larger, more diverse datasets would be valuable to assess the broader applicability of these methods.

Overall, the paper presents a thought-provoking approach to AI alignment that merits further investigation and refinement. Continued research in this area is crucial for developing AI systems that are reliable, trustworthy, and well-aligned with human values.

Conclusion

This paper introduces two key techniques for training more effective and well-aligned AI systems: online merging of multiple reward signals and the introduction of a "tax" term to discourage undesirable behavior. The authors show that these methods can lead to improved performance and more stable and robust training compared to standard reward functions.

While the paper presents a promising approach to some of the challenges in AI alignment, it also highlights the need for further research to address potential limitations and scale the techniques to more complex, real-world problems. Continued advancements in this area are crucial for developing AI systems that are reliable, trustworthy, and well-aligned with human values and intentions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, Furong Huang

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference datasets, which can lead to sub-optimal performance. On the other hand, recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation and suffers from distribution shift issues. To address this, we establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment by exploring responses and regulating preference labels. In doing so, we permit alignment methods to operate in an online and self-improving manner, as well as generalize prior online RLHF methods as special cases. Compared to state-of-the-art iterative RLHF methods, our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.

6/26/2024

cs.LG cs.AI cs.CL stat.ML

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Alfredo Garcia, Mingyi Hong

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning (RL) step to fine-tune the model. Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) technique to (explicitly or implicitly) build an reward model, while learning the policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but also promote the ability to distinguish between the preferred and non-preferred continuations. Moreover, we identify a connection between the proposed IRL based approach, and certain self-play approach proposed recently, and showed that self-play is a special case of modeling a reward-learning agent. Theoretically, we show that the proposed algorithms converge to the stationary solutions of the IRL problem. Empirically, we align 1B and 7B models using proposed methods and evaluate them on a reward benchmark model and the HuggingFace Open LLM Leaderboard. The proposed methods show significant performance improvement over existing SFT approaches. Our results indicate that it is beneficial to explicitly or implicitly leverage reward learning throughout the entire alignment process.

5/30/2024

cs.AI

Learn Your Reference Model for Real Good Alignment

Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov

The complexity of the alignment problem stems from the fact that existing methods are considered unstable. Reinforcement Learning from Human Feedback (RLHF) addresses this issue by minimizing the KL divergence between the trained policy and the initial supervised fine-tuned policy (SFT) to avoid generating out-of-domain samples for the reward model (RM). Recently, many methods have emerged that shift from online to offline optimization, reformulating the RLHF objective and removing the reward model (DPO, IPO, KTO). Despite eliminating the reward model and the challenges it posed, these algorithms are still constrained in terms of closeness of the trained policy to the SFT one. In our paper, we argue that this implicit limitation in the offline optimization methods leads to suboptimal results. To address this issue, we propose a class of new methods called Trust Region (TR-DPO, TR-IPO, TR-KTO), which update the reference policy during training. With this straightforward update approach, we demonstrate the effectiveness of the new paradigm of language model alignment against the classical one on the Anthropic-HH and Reddit TL;DR datasets. Most notably, when automatically comparing TR methods and baselines side by side using pretrained Pythia 6.9B models on the Reddit TL;DR task, the difference in win rates reaches 8.4% for DPO, 14.3% for IPO, and 15% for KTO. Finally, by assessing model response ratings grounded on criteria such as coherence, correctness, helpfulness, and harmlessness, we demonstrate that our proposed methods significantly outperform existing techniques.

5/22/2024

cs.LG cs.CL

Aligning Large Language Models via Fine-grained Supervision

Dehong Xu, Liang Qiu, Minseok Kim, Faisal Ladhak, Jaeyoung Do

Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.

6/6/2024

cs.CL cs.AI cs.LG