LESS: Selecting Influential Data for Targeted Instruction Tuning

2402.04333

Published 6/14/2024 by Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen

LESS: Selecting Influential Data for Targeted Instruction Tuning

Abstract

Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.

Create account to get full access

Overview

This paper introduces LESS, a technique for selecting influential data samples from a training dataset to focus instruction tuning on the most important examples.
The key ideas are to measure the per-step influence of each data sample on the model's training and use this to identify the most important samples for targeted tuning.
The paper demonstrates that LESS can improve model performance compared to standard fine-tuning approaches, especially for low-resource settings.

Plain English Explanation

When training machine learning models, the training data used can have a big impact on the final model performance. LESS: Selecting Influential Data for Targeted Instruction Tuning introduces a way to identify the most influential data samples in the training set and focus the model training on just those samples.

The core idea is to measure how much each individual training example influences the model as it is being trained. Some examples have a bigger impact on the final model than others. By identifying and focusing the training on just the most influential examples, the model can learn more efficiently, especially when there is limited training data available.

This targeted training approach, called LESS, is shown to outperform standard fine-tuning techniques in several experiments. It can be particularly helpful for low-resource settings where there is not a lot of training data available, as it allows the model to focus on learning from the most important examples.

Technical Explanation

The key concept behind LESS is per-step influence, which measures how much each individual training example contributes to the gradients computed during model updates. The paper formalizes this influence metric and shows how it can be efficiently computed.

By ranking the training examples by their per-step influence, the authors identify the most impactful samples. They then fine-tune the model by focusing the training on just these influential examples, rather than the full training set.

The paper demonstrates the effectiveness of this LESS approach through experiments on language modeling and machine translation tasks. Compared to standard fine-tuning, LESS is shown to improve model performance, especially in low-resource settings where there is limited training data available.

Critical Analysis

The LESS technique introduces a novel and promising approach for improving model performance through targeted training on influential data samples. However, the paper does not fully explore the limitations and potential drawbacks of this method.

For example, the paper does not investigate how the LESS approach scales to extremely large datasets or how it performs on more diverse task domains beyond language modeling and translation. There are also open questions about the computational overhead of calculating per-step influence and whether this cost can be reduced.

Additionally, the paper does not delve into potential biases or fairness issues that could arise from focusing model training on the most influential examples, which may not be representative of the full data distribution.

Further research is needed to better understand the strengths, weaknesses, and broader applicability of the LESS technique. Incorporating selective reflection tuning or G-DiG approaches could also help improve the diversity and robustness of the selected influential samples.

Conclusion

The LESS technique introduced in this paper presents a novel and promising approach for improving model performance by identifying and focusing training on the most influential data samples. By measuring the per-step influence of each training example, LESS can selectively fine-tune models on the most impactful examples, leading to better results, especially in low-resource settings.

While the paper demonstrates the effectiveness of this approach, further research is needed to fully understand its limitations and broader applicability. Exploring how LESS scales to larger datasets, its performance on diverse task domains, and potential biases or fairness issues will be important next steps. Combining LESS with other techniques like selective reflection tuning or G-DiG could also help unlock its full potential.

Overall, the LESS method represents an exciting advancement in the field of targeted and efficient model training, with the potential to significantly improve the performance and robustness of machine learning models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Ziche Liu, Rui Ke, Feng Jiang, Haizhou Li

Data selection for fine-tuning Large Language Models (LLMs) aims to select a high-quality subset from a given candidate dataset to train a Pending Fine-tune Model (PFM) into a Selective-Enhanced Model (SEM). It can improve the model performance and accelerate the training process. Although a few surveys have investigated related works of data selection, there is a lack of comprehensive comparison between existing methods due to their various experimental settings. To address this issue, we first propose a three-stage scheme for data selection and comprehensively review existing works according to this scheme. Then, we design a unified comparing method with ratio-based efficiency indicators and ranking-based feasibility indicators to overcome the difficulty of comparing various models with diverse experimental settings. After an in-depth comparative analysis, we find that the more targeted method with data-specific and model-specific quality labels has higher efficiency, but the introduction of additional noise information should be avoided when designing selection algorithms. Finally, we summarize the trends in data selection and highlight the short-term and long-term challenges to guide future research.

6/21/2024

cs.CL

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, Tianyi Zhou

Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But it also leads to extra cost and computation due to the involvement of LLMs in this process. To reduce the filtering cost, we study Superfiltering: Can we use a smaller and weaker model to select data for finetuning a larger and stronger model? Despite the performance gap between weak and strong language models, we find their highly consistent capability to perceive instruction difficulty and data selection results. This enables us to use a much smaller and more efficient model to filter the instruction data used to train a larger language model. Not only does it largely speed up the data filtering, but the filtered-data-finetuned LLM achieves even better performance on standard benchmarks. Extensive experiments validate the efficacy and efficiency of our approach.

6/11/2024

cs.CL

Contrastive Instruction Tuning

Tianyi Lorena Yan, Fei Wang, James Y. Huang, Wenxuan Zhou, Fan Yin, Aram Galstyan, Wenpeng Yin, Muhao Chen

Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs' lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. Accordingly, we propose Contrastive Instruction Tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. To facilitate this approach, we augment the existing FLAN collection by paraphrasing task instructions. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy. Code is available at https://github.com/luka-group/CoIN.

6/7/2024

cs.CL cs.AI cs.LG

Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Jiuxiang Gu, Tianyi Zhou

Instruction tuning is critical to large language models (LLMs) for achieving better instruction following and task adaptation capabilities but its success heavily relies on the training data quality. Many recent methods focus on improving the data quality but often overlook the compatibility of the data with the student model being finetuned. This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM's reflection and introspection for improving existing data quality with the data selection capability of the student LLM, to automatically refine existing instruction-tuning data. This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning and LLMs of superior performance. Selective Reflection-Tuning is a data augmentation and synthesis that generally improves LLM finetuning and self-improvement without collecting brand-new data. We apply our method to Alpaca and WizardLM data and achieve much stronger and top-tier 7B and 13B LLMs.

6/11/2024

cs.CL cs.AI cs.LG