Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

2402.04833

Published 6/5/2024 by Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

⛏️

Abstract

There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses -- that intuitively contain more learnable information and are harder to overfit -- from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the Open LLM benchmarks that test factual knowledge. We demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses. Overall, our findings suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning. We provide our code at https://github.com/tml-epfl/long-is-more-for-alignment.

Create account to get full access

Overview

The paper explores a simple yet effective method for selecting high-quality instruction examples for fine-tuning large language models (LLMs).
It compares this method to more sophisticated approaches like LIMA and AlpaGasus, and shows it can outperform them.
The authors demonstrate the effectiveness of their approach on several LLMs and datasets, and provide an analysis to ensure the results are not due to biases in the evaluation.

Plain English Explanation

When training large language models like GPT-4 and PaLM-2 to follow instructions, it's important to have high-quality examples to fine-tune them on. The authors of this paper found that a simple approach of selecting the 1,000 instructions with the longest responses can outperform more complex methods for curating this data.

The intuition is that longer instructions likely contain more information for the model to learn from, and are harder for the model to overfit on. The authors show this baseline approach consistently performs better than sophisticated techniques like LIMA and AlpaGasus, which use manual curation or AI-based scoring to select high-quality examples.

Importantly, the authors demonstrate this on multiple language models (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k), indicating the findings are robust. They also show that a lightweight refinement of the long instructions can further improve performance, allowing them to achieve competitive results on benchmarks like MT-Bench and AlpacaEval 2.0 while training on just 1,000 examples.

The key takeaway is that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning of large language models. This simple approach can outperform more complex methods, while requiring less effort and data.

Technical Explanation

The paper explores the challenge of selecting high-quality instruction examples for fine-tuning large language models (LLMs) to perform well on instruction-following tasks. The authors compare their proposed approach to two state-of-the-art methods, LIMA and AlpaGasus.

The key idea behind the authors' approach is to select the 1,000 instructions with the longest responses from standard datasets. The intuition is that longer instructions likely contain more learnable information and are harder for the model to overfit on.

The authors evaluate this simple baseline approach on several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k), and find that it consistently outperforms the more sophisticated LIMA and AlpaGasus methods, as judged by GPT-4 and PaLM-2.

Furthermore, the authors demonstrate that a lightweight refinement of the long instructions can further improve the abilities of the fine-tuned LLMs, allowing them to achieve competitive results on benchmarks like MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data.

To ensure the enhanced performance is not simply due to GPT-4's preference for longer responses, the authors conduct a thorough analysis of their models.

Critical Analysis

The paper presents a compelling and practical approach to instruction fine-tuning of LLMs, which appears to outperform more complex methods. However, it's worth considering a few potential limitations and areas for further research:

Generalization to other datasets and tasks: While the authors demonstrate the effectiveness of their approach on several datasets, it would be valuable to see how it performs on a wider range of instruction-following tasks, including those that may require more nuanced understanding or reasoning.
Scalability and efficiency: The authors note that their lightweight refinement of the long instructions can improve performance, but it's unclear how scalable or efficient this process is compared to the more sophisticated methods. Further investigation into the tradeoffs between performance and computational/data requirements would be helpful.
Interpretability and explainability: The paper does not provide much insight into why the simple approach of selecting long instructions performs so well. Exploring the underlying mechanisms and factors that contribute to the improved performance could lead to a better understanding of instruction fine-tuning in general.
Potential biases: Although the authors conduct analysis to ensure the results are not due to GPT-4 biases, it's possible that other biases or limitations in the evaluation may exist. Exploring the potential impacts of such biases on the findings would be valuable.

Overall, the paper presents a compelling and practical approach to instruction fine-tuning, and the authors' willingness to challenge more complex methods is commendable. Further research exploring the generalization, scalability, and interpretability of this approach could yield valuable insights for the broader field of instruction-following LLMs.

Conclusion

This paper introduces a simple yet effective method for selecting high-quality instruction examples to fine-tune large language models (LLMs) for instruction-following tasks. The authors show that a baseline approach of selecting the 1,000 instructions with the longest responses can outperform more sophisticated techniques like LIMA and AlpaGasus, as judged by powerful LLMs like GPT-4 and PaLM-2.

The findings are demonstrated across multiple LLMs and datasets, and the authors also show that a lightweight refinement of the long instructions can further improve performance, allowing them to achieve competitive results on benchmarks like MT-Bench and AlpacaEval 2.0 while training on just 1,000 examples.

These results suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning of large language models. This simple approach can outperform more complex methods, while requiring less effort and data. The insights from this research could have significant implications for the development of more capable and efficient instruction-following AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Wenhu Chen, Ge Zhang

The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).

6/27/2024

cs.CL

💬

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Hieu Tran, Zhichao Yang, Zonghai Yao, Hong Yu

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

6/10/2024

cs.CL cs.AI

✅

Instruction Tuning With Loss Over Instructions

Zhengyan Shi, Adam X. Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, Aldo Lipani

Instruction tuning plays a crucial role in shaping the outputs of language models (LMs) to desired styles. In this work, we propose a simple yet effective method, Instruction Modelling (IM), which trains LMs by applying a loss function to the instruction and prompt part rather than solely to the output part. Through experiments across 21 diverse benchmarks, we show that, in many scenarios, IM can effectively improve the LM performance on both NLP tasks (e.g., MMLU, TruthfulQA, and HumanEval) and open-ended generation benchmarks (e.g., MT-Bench and AlpacaEval). Remarkably, in the most advantageous case, IM boosts model performance on AlpacaEval 1.0 by over 100%. We identify two key factors influencing the effectiveness of IM: (1) The ratio between instruction length and output length in the training data; and (2) The number of training examples. We observe that IM is especially beneficial when trained on datasets with lengthy instructions paired with brief outputs, or under the Superficial Alignment Hypothesis (SAH) where a small amount of training examples are used for instruction tuning. Further analysis substantiates our hypothesis that the improvement can be attributed to reduced overfitting to instruction tuning datasets. Our work provides practical guidance for instruction tuning LMs, especially in low-resource scenarios.

5/24/2024

cs.CL cs.AI

Phased Instruction Fine-Tuning for Large Language Models

Wei Pang, Chuan Zhou, Xiao-Hua Zhou, Xiaojie Wang

Instruction Fine-Tuning enhances pre-trained language models from basic next-word prediction to complex instruction-following. However, existing One-off Instruction Fine-Tuning (One-off IFT) method, applied on a diverse instruction, may not effectively boost models' adherence to instructions due to the simultaneous handling of varying instruction complexities. To improve this, Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process. It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets. Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT, supporting the progressive alignment hypothesis and providing a simple and efficient way to enhance large language models. Codes and datasets from our experiments are freely available at https://github.com/xubuvd/PhasedSFT.

6/18/2024

cs.CL cs.AI