Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Read original: arXiv:2406.10216 - Published 6/17/2024 by Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Overview

This paper introduces a novel approach to learning a generalizable reward model for large language models (LLMs) using a technique called hidden state regularization.
The authors address the challenge of reward overfitting, where the reward model learned during training becomes too specialized to the training distribution, making it difficult to generalize to new tasks or environments.
Their proposed method aims to learn a more robust and transferable reward model by regularizing the hidden states of the LLM, which helps the model capture more generalizable features.

Plain English Explanation

The researchers in this study are trying to solve a problem with how large language models (LLMs) are trained to learn what "good" or "rewarding" behavior looks like. Currently, when LLMs are trained on a specific task, the reward model they learn becomes too specialized to that task, and it's hard for the model to apply that learned reward to new situations.

The key idea behind this work is to regularize the hidden states of the LLM during training. Hidden states are internal representations that the model learns, and by adding a special kind of constraint or "regularization" to these hidden states, the model is encouraged to learn more general, transferable features. This helps the reward model become more robust and able to generalize to new tasks or environments, rather than being overly specialized to the training data.

The authors show that this hidden state regularization approach outperforms other methods for learning generalizable reward models and helps mitigate the problem of reward overoptimization, where the model becomes too focused on maximizing the reward signal during training.

Technical Explanation

The key technical contribution of this paper is a novel regularization method for learning a generalizable reward model for LLMs. The authors start by framing the problem of reward overfitting, where the reward model learned during training becomes overly specialized to the training distribution, making it difficult to apply to new tasks or environments.

To address this, the authors propose a method called Hidden State Regularization (HSR). The core idea is to add a regularization term to the training objective that encourages the LLM's hidden states to be more diverse and partially observed. This helps the model learn more generalizable features, leading to a reward model that is more robust and transferable.

The authors evaluate their approach on a suite of RL tasks and demonstrate that HSR outperforms other methods for learning generalizable reward models. They show that the regularized reward model is able to better handle distributional shift and maintain performance on new tasks, compared to baselines that do not use HSR.

Critical Analysis

The key strength of this work is the novel idea of regularizing the hidden states of the LLM to improve the generalizability of the learned reward model. This is a clever and principled approach to addressing the challenge of reward overfitting, which is an important problem in the field of RL with LLMs.

That said, the paper does not fully explore the limitations of this approach. For example, it's unclear how the choice of regularization hyperparameters affects the performance, or how the method would scale to larger and more complex LLMs. Additionally, the authors do not discuss potential negative societal impacts of deploying such a generalized reward model in the real world.

Further research is needed to better understand the robustness and safety considerations of this approach, as well as to explore its applicability to a wider range of RL tasks and environments. Nonetheless, this work represents a valuable contribution to the ongoing efforts to make RL with LLMs more efficient and reliable.

Conclusion

In this paper, the authors present a novel method called Hidden State Regularization (HSR) for learning a generalizable reward model for large language models (LLMs). By regularizing the hidden states of the LLM during training, the authors are able to learn a more robust and transferable reward model that can better handle distributional shift and perform well on new tasks.

The key insight of this work is that encouraging the LLM to learn more diverse and partially observed hidden representations can lead to a reward model that is less prone to overfitting to the training distribution. This is an important step towards developing RL systems with LLMs that can reliably perform well across a wide range of real-world applications.

While further research is needed to fully understand the limitations and broader implications of this approach, this paper represents a significant contribution to the field of RL with LLMs and lays the groundwork for future advancements in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang

Reward models trained on human preference data have been proven to be effective for aligning Large Language Models (LLMs) with human intent within the reinforcement learning from human feedback (RLHF) framework. However, the generalization capabilities of current reward models to unseen prompts and responses are limited. This limitation can lead to an unexpected phenomenon known as reward over-optimization, where excessive optimization of rewards results in a decline in actual performance. While previous research has advocated for constraining policy optimization, our study proposes a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text generation capabilities, while concurrently learning a reward head behind the same hidden states. Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviate the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.

6/17/2024

Learning Goal-Conditioned Representations for Language Reward Models

Vaskar Nath, Dylan Slack, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead, Sean Hendryx

Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear how improved representation learning can benefit reinforcement learning from human feedback (RLHF) on language models (LMs). In this work, we propose training reward models (RMs) in a contrastive, $textit{goal-conditioned}$ fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves RM performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well -- on the Helpful-Harmless dataset, we observe $2.3%$ increase in accuracy. Beyond improving reward model performance, we show this way of training RM representations enables improved $textit{steerability}$ because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g., whether a solution is correct or helpful). Leveraging this insight, we find that we can filter up to $55%$ of generated tokens during majority voting by discarding trajectories likely to end up in an incorrect state, which leads to significant cost savings. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states. For example, we show that steering a Llama 3 model towards helpful generations with our approach improves helpfulness by $9.6%$ over a supervised-fine-tuning trained baseline. Similarly, steering the model towards complex generations improves complexity by $21.6%$ over the baseline. Overall, we find that training RMs in this contrastive, goal-conditioned fashion significantly improves performance and enables model steerability.

7/22/2024

Scalable Ensembling For Mitigating Reward Overoptimisation

Ahmed M. Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, Sanmi Koyejo

Reinforcement Learning from Human Feedback (RLHF) has enabled significant advancements within language modeling for powerful, instruction-following models. However, the alignment of these models remains a pressing challenge as the policy tends to overfit the learned ``proxy reward model past an inflection point of utility as measured by a ``gold reward model that is more performant -- a phenomenon known as overoptimisation. Prior work has mitigated this issue by computing a pessimistic statistic over an ensemble of reward models, which is common in Offline Reinforcement Learning but incredibly costly for language models with high memory requirements, making such approaches infeasible for sufficiently large models. To this end, we propose using a shared encoder but separate linear heads. We find this leads to similar performance as the full ensemble while allowing tremendous savings in memory and time required for training for models of similar size.

6/21/2024

Generalizing Reward Modeling for Out-of-Distribution Preference Learning

Chen Jia

Preference learning (PL) with large language models (LLMs) aims to align the LLMs' generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is practically useful for enhancing the generalization ability of LLMs with limited preference feedback. This work addresses OOD PL by optimizing a general reward model through a meta-learning approach. During meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.

6/11/2024