Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models

2406.10305

Published 6/18/2024 by Jie Chen, Xintian Han, Yu Ma, Xun Zhou, Liang Xiang

👨‍🏫

Abstract

Automatic code generation has been a longstanding research topic. With the advancement of general-purpose large language models (LLMs), the ability to code stands out as one important measure to the model's reasoning performance. Usually, a two-stage training paradigm is implemented to obtain a Code LLM, namely the pretraining and the fine-tuning. Within the fine-tuning, supervised fine-tuning (SFT), and reinforcement learning (RL) are often used to improve the model's zero-shot ability. A large number of work has been conducted to improve the model's performance on code-related benchmarks with either modifications to the algorithm or refinement of the dataset. However, we still lack a deep insight into the correlation between SFT and RL. For instance, what kind of dataset should be used to ensure generalization, or what if we abandon the SFT phase in fine-tuning. In this work, we make an attempt to understand the correlation between SFT and RL. To facilitate our research, we manually craft 100 basis python functions, called atomic functions, and then a synthesizing pipeline is deployed to create a large number of synthetic functions on top of the atomic ones. In this manner, we ensure that the train and test sets remain distinct, preventing data contamination. Through comprehensive ablation study, we find: (1) Both atomic and synthetic functions are indispensable for SFT's generalization, and only a handful of synthetic functions are adequate; (2) Through RL, the SFT's generalization to target domain can be greatly enhanced, even with the same training prompts; (3) Training RL from scratch can alleviate the over-fitting issue introduced in the SFT phase.

Create account to get full access

Overview

Automatic code generation is a longstanding research topic, and the ability to code is an important measure of large language models' (LLMs) reasoning performance.
Typically, a two-stage training process is used to obtain a Code LLM: pretraining and fine-tuning, which often involves supervised fine-tuning (SFT) and reinforcement learning (RL).
Researchers have worked to improve model performance on code-related benchmarks, but the correlation between SFT and RL is not well understood.

Plain English Explanation

Automatically generating code has been a focus of research for a long time. As large language models have become more advanced, the ability to write code has become an important way to measure how well they can reason and solve problems.

To train these models to write code, a two-step process is usually used. First, the model is pretrained on a large amount of general text data. Then, it goes through a fine-tuning process, where it is trained more specifically on code-related tasks. This fine-tuning often involves both supervised fine-tuning (SFT), where the model is given examples to learn from, and reinforcement learning (RL), where the model learns by trial and error and receiving feedback.

Researchers have made a lot of progress in improving these models' performance on benchmarks (standardized tests) related to coding. However, they still don't fully understand the relationship between the SFT and RL parts of the fine-tuning process. For example, what kind of training data works best, and does the SFT phase even need to be included?

Technical Explanation

To investigate the correlation between SFT and RL, the researchers in this study manually created 100 "atomic" Python functions and then used a synthesizing pipeline to generate a large number of more complex, "synthetic" functions based on those atomic ones. This ensured that the training and test sets remained distinct, preventing the model from simply memorizing the training examples.

Through extensive experiments and analysis, the researchers found:

Atomic and synthetic functions are both important for SFT's generalization: Having a mix of simple and more complex functions in the training data is key for the model to learn general coding principles.
RL can greatly enhance SFT's generalization: Even when using the same training prompts, RL can improve the model's ability to apply what it learned to new, target domains.
Training RL from scratch can reduce overfitting: Starting the RL process without the SFT phase can help the model avoid getting too specialized on the training data.

Critical Analysis

The researchers acknowledge that their dataset of manually created functions, while carefully constructed, may not fully capture the complexity and variety of real-world coding tasks. Expanding the dataset or using more realistic programming problems could be an area for future research.

Additionally, the paper does not explore the potential trade-offs or practical considerations of omitting the SFT phase entirely in favor of starting with RL. There may be efficiency or stability concerns with this approach that warrant further investigation.

Overall, this study provides valuable insights into the interplay between SFT and RL in training code-generation models. However, as with any research, there are still open questions and potential avenues for further exploration to deepen our understanding of this important topic.

Conclusion

This research sheds light on the relationship between supervised fine-tuning (SFT) and reinforcement learning (RL) in training large language models to generate code. The key findings suggest that a balance of simple and complex training examples, as well as the strategic use of RL, can enhance a model's ability to write code in new, unseen contexts.

These insights could inform the development of more robust and versatile code-generation models, which could have significant implications for software engineering, automation, and the broader field of artificial intelligence. As the research in this area continues to evolve, it will be important to carefully consider the tradeoffs and limitations to ensure these models are aligned with human values and priorities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process

Ermo Hua, Biqing Qi, Kaiyan Zhang, Yue Yu, Ning Ding, Xingtai Lv, Kai Tian, Bowen Zhou

Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are two fundamental processes for enhancing the capabilities of Language Models (LMs) post pre-training, aligning them better with human preferences. Although SFT advances in training efficiency, PO delivers better alignment, thus they are often combined. However, common practices simply apply them sequentially without integrating their optimization objectives, ignoring the opportunities to bridge their paradigm gap and take the strengths from both. To obtain a unified understanding, we interpret SFT and PO with two sub-processes -- Preference Estimation and Transition Optimization -- defined at token level within the Markov Decision Process (MDP) framework. This modeling shows that SFT is only a specialized case of PO with inferior estimation and optimization. PO evaluates the quality of model's entire generated answer, whereas SFT only scores predicted tokens based on preceding tokens from target answers. Therefore, SFT overestimates the ability of model, leading to inferior optimization. Building on this view, we introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process. IFT captures LMs' intuitive sense of the entire answers through a temporal residual connection, but it solely relies on a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.

5/29/2024

cs.CL cs.AI

🌀

ReFT: Reasoning with Reinforced Fine-Tuning

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li

One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.

6/28/2024

cs.CL

How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, Jingren Zhou

Large language models (LLMs) with enormous pre-training tokens and parameters emerge diverse abilities, including math reasoning, code generation, and instruction following. These abilities are further enhanced by supervised fine-tuning (SFT). While the open-source community has explored ad-hoc SFT for enhancing individual capabilities, proprietary LLMs exhibit versatility across various skills. Therefore, understanding the facilitation of multiple abilities via SFT is paramount. In this study, we specifically focuses on the interplay of data composition between mathematical reasoning, code generation, and general human-aligning abilities during SFT. We propose four intriguing research questions to explore the association between model performance and various factors including data amount, composition ratio, model size and SFT strategies. Our experiments reveal that distinct capabilities scale differently and larger models generally show superior performance with same amount of data. Mathematical reasoning and code generation consistently improve with increasing data amount, whereas general abilities plateau after roughly a thousand samples. Moreover, we observe data composition appears to enhance various abilities under limited data conditions, yet can lead to performance conflicts when data is plentiful. Our findings also suggest the amount of composition data influences performance more than the composition ratio. In analysis of SFT strategies, we find that sequentially learning multiple skills risks catastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT) strategy offers a promising solution to learn multiple abilities with different scaling patterns.

6/10/2024

cs.CL cs.AI cs.LG

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Jiaxiang Li, Siliang Zeng, Hoi-To Wai, Chenliang Li, Alfredo Garcia, Mingyi Hong

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning (RL) step to fine-tune the model. Such reward model serves as a proxy to human preference, and it is critical to guide the RL step towards improving the model quality. In this work, we argue that the SFT stage significantly benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse Reinforcement Learning (IRL) technique to (explicitly or implicitly) build an reward model, while learning the policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but also promote the ability to distinguish between the preferred and non-preferred continuations. Moreover, we identify a connection between the proposed IRL based approach, and certain self-play approach proposed recently, and showed that self-play is a special case of modeling a reward-learning agent. Theoretically, we show that the proposed algorithms converge to the stationary solutions of the IRL problem. Empirically, we align 1B and 7B models using proposed methods and evaluate them on a reward benchmark model and the HuggingFace Open LLM Leaderboard. The proposed methods show significant performance improvement over existing SFT approaches. Our results indicate that it is beneficial to explicitly or implicitly leverage reward learning throughout the entire alignment process.

5/30/2024

cs.AI