Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

2406.13542

Published 6/21/2024 by Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Abstract

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

Create account to get full access

Overview

This research paper explores a novel technique called "self-play with execution feedback" to improve the instruction-following capabilities of large language models (LLMs).
The key idea is to have the LLM engage in a self-play process where it generates and executes instructions, then receives feedback on the quality of its execution.
This feedback is then used to fine-tune the LLM, enabling it to better understand and follow instructions over time.

Plain English Explanation

The researchers investigated a way to make large language models (LLMs) better at following instructions. LLMs are powerful AI systems that can understand and generate human language, but they sometimes struggle with carrying out specific tasks or instructions.

The researchers' approach involves having the LLM practice following instructions on its own. First, the LLM generates some instructions. Then, it tries to execute those instructions. Importantly, the LLM also receives feedback on how well it did at executing the instructions. This feedback is used to fine-tune or improve the LLM, so that it can better understand and follow instructions in the future.

The key insight is that by practicing instruction-following and getting feedback on its performance, the LLM can learn and get better over time. This is similar to how humans improve at tasks through practice and feedback. By using this self-play approach, the researchers were able to significantly boost the LLM's instruction-following capabilities.

Technical Explanation

The paper introduces a novel technique called "self-play with execution feedback" to improve the instruction-following abilities of large language models (LLMs). [1] The core idea is to have the LLM engage in a self-play process where it first generates a set of instructions, then attempts to execute those instructions, and finally receives feedback on the quality of its execution.

This feedback is then used to fine-tune the LLM, enabling it to better understand and follow instructions over time. The researchers hypothesize that this self-play process, combined with the execution feedback, will lead to substantial improvements in the LLM's instruction-following performance.

To implement this approach, the researchers first train an LLM on a large corpus of natural language data. They then fine-tune this base model on a dataset of instruction-following tasks, using a combination of supervised learning and reinforcement learning techniques. [2] During the fine-tuning process, the LLM engages in the self-play cycle of generating instructions, executing them, and receiving feedback on its performance.

The researchers evaluate their approach on a variety of benchmark instruction-following tasks and find that it significantly outperforms previous state-of-the-art methods. They attribute these gains to the LLM's ability to learn from its own mistakes and continuously improve its instruction-following capabilities through the self-play process.

Critical Analysis

The paper presents a promising approach for enhancing the instruction-following capabilities of large language models. By having the LLM practice generating and executing instructions, and then receive direct feedback on its performance, the researchers are able to drive substantial improvements in a core language understanding task. [3]

However, the paper does not delve deeply into the specific mechanisms by which the self-play and feedback process leads to these performance gains. It would be valuable to have a more detailed analysis of the learning dynamics at play and the types of instruction-following skills the LLM is able to acquire through this approach.

Additionally, the paper focuses primarily on evaluating the LLM's performance on relatively narrow, constrained instruction-following tasks. It remains to be seen how well this technique would generalize to more open-ended, real-world instruction-following scenarios, where the language and context are more complex and ambiguous.

Further research is also needed to better understand the broader implications and potential limitations of this self-play approach. For example, it is unclear how the technique might scale to larger and more capable LLMs, or how it might interact with other fine-tuning or training strategies. [4]

Conclusion

Overall, this research represents an important step forward in enhancing the instruction-following capabilities of large language models. By leveraging a self-play process with execution feedback, the researchers are able to significantly boost the LLM's performance on a range of instruction-following tasks.

This work has the potential to unlock new applications and use cases for LLMs, particularly in domains that require clear and reliable task execution. As language models continue to grow in power and sophistication, techniques like self-play with execution feedback will likely play an increasingly important role in ensuring they can be safely and effectively deployed in real-world settings.

[1] https://aimodels.fyi/papers/arxiv/from-quantity-to-quality-boosting-llm-performance [2] https://aimodels.fyi/papers/arxiv/policy-improvement-using-language-feedback-models [3] https://aimodels.fyi/papers/arxiv/towards-robust-instruction-tuning-multimodal-large-language [4] https://aimodels.fyi/papers/arxiv/phased-instruction-fine-tuning-large-language-models, https://aimodels.fyi/papers/arxiv/optimizing-testing-instruction-following-analyzing-impact-fine

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🚀

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao

In the realm of Large Language Models (LLMs), the balance between instruction data quality and quantity is a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability. Through the application of IFD, cherry samples can be pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on datasets like Alpaca and WizardLM underpin our findings; with a mere $10%$ of original data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the instruction tuning of LLMs, promising both efficiency and resource-conscious advancements. Codes, data, and models are available: https://github.com/tianyi-lab/Cherry_LLM

4/9/2024

cs.CL

Policy Improvement using Language Feedback Models

Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre C^ot'e

We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. Third, LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Finally, LFM can be modified to provide human-interpretable feedback without performance loss, allowing human verification of desirable behaviour for imitation learning.

4/22/2024

cs.LG cs.AI cs.CL

Towards Robust Instruction Tuning on Multimodal Large Language Models

Wei Han, Hui Chen, Soujanya Poria

Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

6/17/2024

cs.CL cs.AI

Phased Instruction Fine-Tuning for Large Language Models

Wei Pang, Chuan Zhou, Xiao-Hua Zhou, Xiaojie Wang

Instruction Fine-Tuning enhances pre-trained language models from basic next-word prediction to complex instruction-following. However, existing One-off Instruction Fine-Tuning (One-off IFT) method, applied on a diverse instruction, may not effectively boost models' adherence to instructions due to the simultaneous handling of varying instruction complexities. To improve this, Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process. It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets. Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT, supporting the progressive alignment hypothesis and providing a simple and efficient way to enhance large language models. Codes and datasets from our experiments are freely available at https://github.com/xubuvd/PhasedSFT.

6/18/2024

cs.CL cs.AI