Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model

2404.10306

Published 6/4/2024 by Hengyuan Zhang, Yanru Wu, Dawei Li, Sak Yang, Rui Zhao, Yong Jiang, Fei Tan

Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model

Abstract

Aligned Large Language Models (LLMs) showcase remarkable versatility, capable of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected to exhibit speciality, excelling in specific applications. However, fine-tuning with extra data, a common practice to gain speciality, often leads to catastrophic forgetting (CF) of previously acquired versatility, hindering the model's performance across diverse tasks. In response to this challenge, we propose CoFiTune, a coarse to fine framework in an attempt to strike the balance between speciality and versatility. At the coarse-grained level, an empirical tree-search algorithm is utilized to pinpoint and update specific modules that are crucial for speciality, while keeping other parameters frozen; at the fine-grained level, a soft-masking mechanism regulates the update to the LLMs, mitigating the CF issue without harming speciality. In an overall evaluation of both speciality and versatility, CoFiTune consistently outperforms baseline methods across diverse tasks and model scales. Compared to the full-parameter SFT, CoFiTune leads to about 14% versatility improvement and marginal speciality loss on a 13B model. Lastly, based on further analysis, we provide a speculative insight into the information forwarding process in LLMs, which helps explain the effectiveness of the proposed method. The code is available at https://github.com/rattlesnakey/CoFiTune.

Create account to get full access

Overview

This paper presents a "coarse-to-fine" framework for supervised fine-tuning of large language models (LLMs) to balance model specialization and versatility.
The framework involves two stages: first, a coarse-grained fine-tuning on a broader task, followed by a fine-grained fine-tuning on a more specific task.
The authors claim this approach can outperform standard fine-tuning on both specialized and generalized performance metrics.

Plain English Explanation

The paper discusses a new way to fine-tune large language models to make them both specialized and versatile. Large language models like GPT-3 are powerful, but they can struggle to be good at very specific tasks while also being generally capable.

The authors' solution is a two-stage approach. First, they fine-tune the model on a broader task to give it some general skills. Then, they fine-tune it again on a more specific task to make it an expert in that area. This approach is similar to how humans learn, starting with general knowledge and then specializing.

The key insight is that this "coarse-to-fine" framework can produce models that outperform standard fine-tuning techniques on both specialized and generalized performance metrics. In other words, the model becomes an expert at the specific task while still maintaining strong overall capabilities. This could be useful for applications that require both specialized and general language understanding.

Technical Explanation

The paper proposes a "coarse-to-fine" framework for supervised fine-tuning of large language models. The approach involves two stages:

Coarse-grained fine-tuning: The model is first fine-tuned on a broader task or dataset to impart general capabilities.
Fine-grained fine-tuning: The model is then fine-tuned on a more specific task or dataset to develop specialized expertise.

The authors claim this approach can outperform standard fine-tuning techniques on both specialized and generalized performance metrics. They evaluate their framework on a range of language tasks, including text classification, question answering, and natural language inference.

The key technical contributions include:

A detailed formulation of the coarse-to-fine framework
Empirical analysis of the trade-offs between specialization and versatility
Insights into how the framework impacts model behavior and performance

Critical Analysis

The paper presents a well-designed and thorough exploration of the coarse-to-fine fine-tuning approach. However, some potential limitations and areas for further research are worth noting:

The framework assumes the availability of distinct coarse and fine-grained datasets, which may not always be the case in practice. Techniques for automatically partitioning datasets could be an interesting extension.
The experiments focus on supervised tasks, but the framework could potentially be applied to other fine-tuning settings, such as reinforcement learning or unsupervised pretraining. Exploring these alternative settings could broaden the applicability of the approach.
While the authors discuss the trade-offs between specialization and versatility, a more formal treatment of this fundamental tension could provide additional insights.

Overall, the paper makes a compelling case for the coarse-to-fine framework as an effective way to balance the competing goals of model specialization and generalization.

Conclusion

This paper presents a novel "coarse-to-fine" framework for supervised fine-tuning of large language models. By first fine-tuning on a broader task and then on a more specific task, the approach can produce models that excel at both specialized and generalized performance.

The key insight is that this two-stage fine-tuning process allows the model to develop both specialized expertise and broad capabilities, addressing a fundamental challenge in language model development. The technical evaluation and analysis provide strong evidence for the effectiveness of the coarse-to-fine framework, suggesting it could be a valuable tool for a wide range of language-based applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Comparative Analysis of Different Efficient Fine Tuning Methods of Large Language Models (LLMs) in Low-Resource Setting

Krishna Prasad Varadarajan Srinivasan, Prasanth Gumpena, Madhusudhana Yattapu, Vishal H. Brahmbhatt

In the domain of large language models (LLMs), arXiv:2305.16938 showed that few-shot full-model fine-tuning -- namely Vanilla Fine Tuning (FT) and Pattern-Based Fine Tuning (PBFT) --, and In-Context Learning (ICL) generalize similarly on Out-Of-Domain (OOD) datasets, but vary in terms of task adaptation. However, they both pose challenges, especially in term of memory requirements. In this paper, we further try to push the understanding of different fine-tuning strategies for LLM and aim to bring a myriad of these on the same pedestal for an elaborate comparison with full-model fine-tuning on two diverse datasets. To that end, we conducted a series of experiments, beginning with state-of-the-art methods like vanilla fine-tuning and Pattern-Based Fine-Tuning (PBFT) on pre-trained models across two datasets, COLA and MNLI. We then investigate adaptive fine-tuning and the efficiency of LoRA adapters in a few-shot setting. Finally, we also compare an alternative approach that has gained recent popularity -- context distillation -- with the vanilla FT and PBFT with and without few-shot setup. Our findings suggest that these alternative strategies that we explored can exhibit out-of-domain generalization comparable to that of vanilla FT and PBFT. PBFT under-performs Vanilla FT on out-of-domain (OOD) data, emphasizing the need for effective prompts. Further, our adaptive-fine tuning and LoRA experiments perform comparable or slightly worse than the standard fine-tunings as anticipated, since standard fine-tunings involve tuning the entire model. Finally, our context distillation experiments out-perform the standard fine-tuning methods. These findings underscore that eventually the choice of an appropriate fine-tuning method depends on the available resources (memory, compute, data) and task adaptability.

5/24/2024

cs.CL cs.LG

📊

Mixture-of-Skills: Learning to Optimize Data Usage for Fine-Tuning Large Language Models

Minghao Wu, Thuy-Trang Vu, Lizhen Qu, Gholamreza Haffari

Large language models (LLMs) are typically fine-tuned on diverse and extensive datasets sourced from various origins to develop a comprehensive range of skills, such as writing, reasoning, chatting, coding, and more. Each skill has unique characteristics, and these datasets are often heterogeneous and imbalanced, making the fine-tuning process highly challenging. Balancing the development of each skill while ensuring the model maintains its overall performance requires sophisticated techniques and careful dataset curation. In this work, we propose a general, model-agnostic, reinforcement learning framework, Mixture-of-Skills (MoS), that learns to optimize data usage automatically during the fine-tuning process. This framework ensures the optimal comprehensive skill development of LLMs by dynamically adjusting the focus on different datasets based on their current learning state. To validate the effectiveness of MoS, we conduct extensive experiments using three diverse LLM backbones on two widely used benchmarks and demonstrate that MoS substantially enhances model performance. Building on the success of MoS, we propose MoSpec, an adaptation for task-specific fine-tuning, which harnesses the utilities of various datasets for a specific purpose. Our work underlines the significance of dataset rebalancing and present MoS as a powerful, general solution for optimizing data usage in the fine-tuning of LLMs for various purposes.

6/14/2024

cs.CL

🤔

Data-efficient Fine-tuning for LLM-based Recommendation

Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, Tat-Seng Chua

Leveraging Large Language Models (LLMs) for recommendation has recently garnered considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the cost of fine-tuning LLMs on rapidly expanding recommendation data limits their practical application. To address this challenge, few-shot fine-tuning offers a promising approach to quickly adapt LLMs to new recommendation data. We propose the task of data pruning for efficient LLM-based recommendation, aimed at identifying representative samples tailored for LLMs' few-shot fine-tuning. While coreset selection is closely related to the proposed task, existing coreset selection methods often rely on suboptimal heuristic metrics or entail costly optimization on large-scale recommendation data. To tackle these issues, we introduce two objectives for the data pruning task in the context of LLM-based recommendation: 1) high accuracy aims to identify the influential samples that can lead to high overall performance; and 2) high efficiency underlines the low costs of the data pruning process. To pursue the two objectives, we propose a novel data pruning method based on two scores, i.e., influence score and effort score, to efficiently identify the influential samples. Particularly, the influence score is introduced to accurately estimate the influence of sample removal on the overall performance. To achieve low costs of the data pruning process, we use a small-sized surrogate model to replace LLMs to obtain the influence score. Considering the potential gap between the surrogate model and LLMs, we further propose an effort score to prioritize some hard samples specifically for LLMs. Empirical results on three real-world datasets validate the effectiveness of our proposed method. In particular, the proposed method uses only 2% samples to surpass the full data fine-tuning, reducing time costs by 97%.

6/5/2024

cs.IR

🛠️

LoFiT: Localized Fine-tuning on LLM Representations

Fangcong Yin, Xi Ye, Greg Durrett

Recent work in interpretability shows that large language models (LLMs) can be adapted for new tasks in a learning-free way: it is possible to intervene on LLM representations to elicit desired behaviors for alignment. For instance, adding certain bias vectors to the outputs of certain attention heads is reported to boost the truthfulness of models. In this work, we show that localized fine-tuning serves as an effective alternative to such representation intervention methods. We introduce a framework called Localized Fine-Tuning on LLM Representations (LoFiT), which identifies a subset of attention heads that are most important for learning a specific task, then trains offset vectors to add to the model's hidden representations at those selected heads. LoFiT localizes to a sparse set of heads (3%) and learns the offset vectors from limited training data, comparable to the settings used for representation intervention. For truthfulness and reasoning tasks, we find that LoFiT's intervention vectors are more effective for LLM adaptation than vectors from representation intervention methods such as Inference-time Intervention. We also find that the localization step is important: selecting a task-specific set of attention heads can lead to higher performance than intervening on heads selected for a different task. Finally, for the tasks we study, LoFiT achieves comparable performance to other parameter-efficient fine-tuning methods such as LoRA, despite modifying 20x-200x fewer parameters than these methods.

6/4/2024

cs.CL