In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Read original: arXiv:2408.03560 - Published 8/9/2024 by Ayrton San Joaquin, Bin Wang, Zhengyuan Liu, Nicholas Asher, Brian Lim, Philippe Muller, Nancy Chen

In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Overview

This paper introduces a novel technique called In2Core for selecting a "coreset" of the most influential training examples during fine-tuning of large language models (LLMs) on instruction-following tasks.
The key idea is to leverage influence functions to identify the training examples that have the largest impact on the model's performance, and use this to construct a smaller, more efficient coreset for fine-tuning.
Experiments show that In2Core can achieve comparable performance to full fine-tuning, but with significantly fewer training examples, leading to faster and more data-efficient fine-tuning.

Plain English Explanation

When training large language models like GPT-3 or BERT on specific tasks, a common approach is "fine-tuning" - taking the pre-trained model and further training it on a smaller, task-specific dataset. This allows the model to specialize and perform better on the target task.

However, fine-tuning can be computationally expensive and time-consuming, especially when the task-specific dataset is large. The key insight of this paper is that not all training examples are equally important - some have a much bigger impact on the final model performance than others.

The researchers developed a technique called In2Core that uses "influence functions" to identify the most influential training examples. Influence functions measure how much each training example contributes to the final model. By selecting a small "coreset" of the top influential examples, the researchers showed they could achieve comparable performance to full fine-tuning, but with significantly fewer training examples.

This makes the fine-tuning process much faster and more data-efficient, which is especially important for tasks where labeled data is scarce or expensive to obtain. The paper demonstrates the effectiveness of In2Core on several instruction-following language tasks.

Technical Explanation

The core innovation of this paper is the In2Core method for coreset selection during fine-tuning of large language models on instruction-following tasks. The key steps are:

Compute Influence Functions: The researchers use influence functions to estimate the influence of each training example on the final model parameters. Influence functions measure how much changing a particular training example would change the final model.
Select a Coreset: Based on the influence scores, In2Core selects a small "coreset" of the most influential training examples. This coreset is then used to fine-tune the pre-trained language model, instead of the full training set.
Fine-tune on the Coreset: The researchers fine-tune the pre-trained model using only the selected coreset of examples. They show this achieves comparable performance to full fine-tuning, but with significantly fewer training examples.

The paper evaluates In2Core on several instruction-following language tasks, including instruction-following on the HumanEval and Anthropic Corpus datasets. The results demonstrate the data efficiency benefits of the coreset approach compared to full fine-tuning.

Critical Analysis

The key strength of this work is the clever use of influence functions to identify the most important training examples for fine-tuning. This allows the models to be trained much more efficiently, which is crucial for scaling up to large language models and real-world tasks.

One potential limitation is that the influence function computations can be computationally expensive, especially for very large language models. The paper does not provide a detailed analysis of the computational overhead of this step.

Additionally, the paper focuses on instruction-following tasks, but it's unclear how well In2Core would generalize to other fine-tuning scenarios. Further research is needed to understand the broader applicability of this technique.

Finally, the paper does not explore the robustness or generalization of the selected coresets. It's possible that models fine-tuned on a small coreset could be more brittle or have reduced out-of-distribution performance compared to full fine-tuning.

Conclusion

This paper presents a promising new technique called In2Core that leverages influence functions to enable data-efficient fine-tuning of large language models on instruction-following tasks. By selecting a small coreset of the most influential training examples, In2Core can achieve comparable performance to full fine-tuning, but with significantly fewer training examples.

This has important implications for scaling up language models to real-world applications, where labeled data can be scarce or expensive to obtain. The paper demonstrates the effectiveness of In2Core on several benchmark datasets, but further research is needed to explore its broader applicability and potential limitations.

Overall, this work represents an exciting advance in the field of efficient fine-tuning of large language models, and could pave the way for more data-efficient and scalable approaches to specializing these powerful models for specific tasks and domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Ayrton San Joaquin, Bin Wang, Zhengyuan Liu, Nicholas Asher, Brian Lim, Philippe Muller, Nancy Chen

Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set's coverage of those test points.

8/9/2024

🤔

Data-efficient Fine-tuning for LLM-based Recommendation

Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, Tat-Seng Chua

Leveraging Large Language Models (LLMs) for recommendation has recently garnered considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the cost of fine-tuning LLMs on rapidly expanding recommendation data limits their practical application. To address this challenge, few-shot fine-tuning offers a promising approach to quickly adapt LLMs to new recommendation data. We propose the task of data pruning for efficient LLM-based recommendation, aimed at identifying representative samples tailored for LLMs' few-shot fine-tuning. While coreset selection is closely related to the proposed task, existing coreset selection methods often rely on suboptimal heuristic metrics or entail costly optimization on large-scale recommendation data. To tackle these issues, we introduce two objectives for the data pruning task in the context of LLM-based recommendation: 1) high accuracy aims to identify the influential samples that can lead to high overall performance; and 2) high efficiency underlines the low costs of the data pruning process. To pursue the two objectives, we propose a novel data pruning method based on two scores, i.e., influence score and effort score, to efficiently identify the influential samples. Particularly, the influence score is introduced to accurately estimate the influence of sample removal on the overall performance. To achieve low costs of the data pruning process, we use a small-sized surrogate model to replace LLMs to obtain the influence score. Considering the potential gap between the surrogate model and LLMs, we further propose an effort score to prioritize some hard samples specifically for LLMs. Empirical results on three real-world datasets validate the effectiveness of our proposed method. In particular, the proposed method uses only 2% samples to surpass the full data fine-tuning, reducing time costs by 97%.

6/5/2024

🏋️

Token-wise Influential Training Data Retrieval for Large Language Models

Huawei Lin, Jikai Long, Zhaozhuo Xu, Weijie Zhao

Given a Large Language Model (LLM) generation, how can we identify which training data led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to LLMs for estimating the influence of each training data. The proposed framework consists of two stages: caching and retrieval. First, we compress the gradient vectors by over 200,000x, allowing them to be cached on disk or in GPU/CPU memory. Then, given a generation, RapidIn efficiently traverses the cached gradients to estimate the influence within minutes, achieving over a 6,326x speedup. Moreover, RapidIn supports multi-GPU parallelization to substantially accelerate caching and retrieval. Our empirical result confirms the efficiency and effectiveness of RapidIn.

5/21/2024

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun

Large language models (LLMs) are initially pretrained for broad capabilities and then finetuned with instruction-following datasets to improve their performance in interacting with humans. Despite advances in finetuning, a standardized guideline for selecting high-quality datasets to optimize this process remains elusive. In this paper, we first propose InstructMining, an innovative method designed for automatically selecting premium instruction-following data for finetuning LLMs. Specifically, InstructMining utilizes natural language indicators as a measure of data quality, applying them to evaluate unseen datasets. During experimentation, we discover that double descent phenomenon exists in large language model finetuning. Based on this observation, we further leverage BlendSearch to help find the best subset among the entire dataset (i.e., 2,532 out of 100,000). Experiment results show that InstructMining-7B achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.

7/30/2024