Training Language Models to Self-Correct via Reinforcement Learning

Read original: arXiv:2409.12917 - Published 9/20/2024 by Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs and 8 others

137

🏋️

Overview

Large language models (LLMs) are powerful AI systems that can generate human-like text, but they often struggle with self-correction.
Existing approaches to improve self-correction either require multiple models or rely on more capable models or additional supervision.
The researchers developed a new method called SCoRe that significantly improves an LLM's self-correction ability using only self-generated data.

Plain English Explanation

The research paper explores the challenge of getting large language models (LLMs) to effectively correct their own mistakes. LLMs are AI systems that can generate human-like text, but they often struggle to catch and fix their own errors.

Existing methods for improving self-correction either require having multiple models work together or rely on a more powerful model or other forms of external guidance to help with the corrections. In contrast, the researchers developed a new approach called SCoRe that can significantly boost an LLM's self-correction abilities using only the model's own self-generated data.

The key insight is that simply fine-tuning the model on its own correction traces (examples of the model correcting itself) is not enough. This can lead to the model only learning to correct in certain predictable ways, or to a mismatch between the training data and the model's real-world behavior.

To address these issues, SCoRe uses a multi-step reinforcement learning process. First, it runs the model through an initial phase of reinforcement learning to generate a better starting point for the self-correction policy. Then, it uses a reward system to encourage the model to engage in more effective self-correction during the main training phase.

By using this approach, the researchers were able to significantly boost the self-correction performance of two different LLMs, Gemini 1.0 Pro and 1.5 Flash, on standard benchmarks like MATH and HumanEval.

Technical Explanation

The researchers first show that straightforward approaches like supervised fine-tuning (SFT) on model-generated correction traces are insufficient for instilling robust self-correction capabilities in LLMs. SFT either suffers from a distribution mismatch between the training data and the model's real outputs, or it implicitly leads the model to learn a narrow set of correction behaviors that may not generalize well.

To address these challenges, the researchers developed SCoRe, a multi-turn online reinforcement learning (RL) approach. The key elements of SCoRe are:

A first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse.
Using a reward bonus to amplify self-correction during the main training phase, encouraging the model to learn an effective self-correction strategy.

When applied to the Gemini 1.0 Pro and 1.5 Flash models, SCoRe achieved state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

Critical Analysis

The paper provides a thoughtful analysis of the limitations of existing approaches and carefully designs the SCoRe method to address these challenges. However, the researchers acknowledge that SCoRe still has some room for improvement.

For example, the paper mentions that SCoRe's performance can be sensitive to the choice of hyperparameters and reward function. This suggests that further research may be needed to make SCoRe more robust and easier to tune.

Additionally, while SCoRe shows impressive gains on the specific benchmarks tested, it would be valuable to see how it performs on a wider range of tasks and in more real-world scenarios. Exploring the model's self-correction abilities in open-ended conversational settings could provide additional insights.

Conclusion

This research represents an important step forward in improving the self-correction capabilities of large language models. By developing the SCoRe method, the researchers have shown that it is possible to significantly boost an LLM's self-correction abilities using only self-generated data, without relying on external supervision or more capable models.

The insights and techniques presented in this paper could have far-reaching implications for making LLMs more robust, reliable, and trustworthy, which is critical as these models become increasingly integrated into real-world applications and decision-making processes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

137

New!Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

9/20/2024

A Theoretical Understanding of Self-Correction through In-context Alignment

Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka, Yisen Wang

Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.

5/30/2024

Large Language Models Can Self-Correct with Minimal Effort

Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, Meng Jiang

Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective verification method can unleash inherent capabilities of the LLMs. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numeric value in a math question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo as the backend LLM, yields $+6.8$ exact match on four open-domain question answering datasets, $+14.1$ accuracy on three arithmetic reasoning datasets, and $+9.6$ accuracy on a commonsense reasoning dataset, compared to Self-Correct.

6/26/2024

💬

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether small (<= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.

6/7/2024