Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Read original: arXiv:2406.16797 - Published 6/26/2024 by Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy Weissman, Prateek Mittal

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Overview

This paper proposes a method called Lottery Ticket Adaptation (LTA) to mitigate the issue of destructive interference when fine-tuning large language models (LLMs) on multiple tasks.
Destructive interference can occur when fine-tuning an LLM on a new task, as the model may forget or degrade its performance on previously learned tasks.
The key idea of LTA is to identify and preserve the "lottery ticket" subnetworks within the LLM that are critical for each task, allowing the model to maintain its performance on those tasks during adaptation.

Plain English Explanation

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs tackles the problem of "destructive interference" that can occur when fine-tuning large language models (LLMs) on multiple tasks. Destructive interference happens when training an LLM on a new task causes it to forget or perform poorly on previously learned tasks.

The researchers propose a method called Lottery Ticket Adaptation (LTA) to address this issue. The core idea is to identify and preserve the most important "lottery ticket" subnetworks within the LLM that are critical for each task. By keeping these task-specific subnetworks intact during fine-tuning, the model can maintain its performance on those tasks while adapting to the new task.

This approach is inspired by the lottery ticket hypothesis, which suggests that neural networks contain sparse subnetworks that are sufficient for solving a particular task. LTA leverages this idea to selectively fine-tune the LLM while protecting the essential task-specific subnetworks.

Technical Explanation

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs presents a method called Lottery Ticket Adaptation (LTA) to address the problem of destructive interference when fine-tuning large language models (LLMs) on multiple tasks.

The authors first train the LLM on a set of base tasks, then identify the "lottery ticket" subnetworks within the model that are critical for each task using the Iterative Magnitude Pruning (IMP) algorithm. These task-specific lottery ticket subnetworks are preserved during the fine-tuning process on a new task, allowing the model to maintain its performance on the base tasks while adapting to the new task.

The researchers evaluate LTA on several language modeling and text classification tasks, comparing it to standard fine-tuning and other adaptation techniques like LORA and LoFiT. The results show that LTA can effectively mitigate destructive interference, leading to improved performance on the base tasks while matching the fine-tuning performance on the new task.

Critical Analysis

The Lottery Ticket Adaptation paper presents a compelling approach to address the challenge of destructive interference in large language model fine-tuning. By preserving the task-specific "lottery ticket" subnetworks, the method allows the model to retain its capabilities on previously learned tasks while adapting to new ones.

One potential limitation of the approach is the computational overhead required to identify the lottery ticket subnetworks for each task. The authors mention that this step can be time-consuming, which could limit the practical applicability of LTA, especially for rapidly evolving real-world applications.

Additionally, the paper focuses on evaluating LTA on a limited set of tasks and datasets. Further research would be needed to assess the generalizability of the method across a wider range of domains and task types.

It would also be valuable to explore the interpretability and explainability of the identified lottery ticket subnetworks. Understanding the specific features and representations that these subnetworks capture could provide valuable insights into the inner workings of large language models and how they adapt to different tasks.

Conclusion

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs presents a novel approach to address the problem of destructive interference when fine-tuning large language models on multiple tasks. By preserving the task-specific "lottery ticket" subnetworks within the LLM, the proposed Lottery Ticket Adaptation (LTA) method can maintain the model's performance on previously learned tasks while still adapting to new tasks.

The results of the study demonstrate the effectiveness of LTA in mitigating destructive interference, suggesting that this technique could be a valuable tool for improving the adaptability and robustness of large language models in real-world applications. However, further research is needed to address the computational overhead and explore the broader applicability of the method.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Ashwinee Panda, Berivan Isik, Xiangyu Qi, Sanmi Koyejo, Tsachy Weissman, Prateek Mittal

Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over lottery tickets (or sparse task vectors), LoTA also enables model merging over highly dissimilar tasks. Our code is made publicly available at https://github.com/kiddyboots216/lottery-ticket-adaptation.

6/26/2024

💬

KS-Lottery: Finding Certified Lottery Tickets for Multilingual Language Models

Fei Yuan, Chang Ma, Shuai Yuan, Qiushi Sun, Lei Li

The lottery ticket hypothesis posits the existence of ``winning tickets'' within a randomly initialized neural network. Do winning tickets exist for LLMs in fine-tuning scenarios? How can we find such winning tickets? In this paper, we propose KS-Lottery, a method to identify a small subset of LLM parameters highly effective in multilingual fine-tuning. Our key idea is to use Kolmogorov-Smirnov Test to analyze the distribution shift of parameters before and after fine-tuning. We further theoretically prove that KS-Lottery can find the certified winning tickets in the embedding layer, fine-tuning on the found parameters is guaranteed to perform as well as full fine-tuning. Comparing KS-Lottery with other parameter-efficient tuning algorithms on translation tasks, the experimental results show that KS-Lottery finds a much smaller set of parameters for fine-tuning while achieving the comparable performance as full fine-tuning LLM. Surprisingly, we find that fine-tuning 18 tokens' embedding of LLaMA suffices to reach the fine-tuning translation performance~footnote{https://github.com/CONE-MT/KS-Lottery.}.

6/4/2024

LOTUS: Improving Transformer Efficiency with Sparsity Pruning and Data Lottery Tickets

Ojasw Upadhyay

Vision transformers have revolutionized computer vision, but their computational demands present challenges for training and deployment. This paper introduces LOTUS (LOttery Transformers with Ultra Sparsity), a novel method that leverages data lottery ticket selection and sparsity pruning to accelerate vision transformer training while maintaining accuracy. Our approach focuses on identifying and utilizing the most informative data subsets and eliminating redundant model parameters to optimize the training process. Through extensive experiments, we demonstrate the effectiveness of LOTUS in achieving rapid convergence and high accuracy with significantly reduced computational requirements. This work highlights the potential of combining data selection and sparsity techniques for efficient vision transformer training, opening doors for further research and development in this area.

5/3/2024

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Weiyu Huang, Yuezhou Hu, Guohao Jian, Jun Zhu, Jianfei Chen

The tremendous success of Large Language Models (LLMs) across various complex tasks relies heavily on their substantial scale, which raises challenges during model deployment due to their large memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often experience considerable performance degradation on complex language understanding tasks, calling into question the feasibility of pruning in LLMs. To address this issue, we propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST). Unlike previous one-shot pruning methods, AST incrementally transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process. Furthermore, we observe that using distillation with a dense model as the teacher can prevent the sparse model from falling into local optima and accelerate convergence. In addition, we incorporate extra well-initialized parameters to further enhance model performance with minimal increase in memory footprint. AST can significantly enhance model performance, approaching the level of dense models. When applied to the LLaMA2-7B model, AST reduces the zero-shot accuracy gap between dense and semi-structured sparse models to 1.12% across multiple zero-shot tasks, utilizing less than 0.4% of the pretraining tokens. Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models when combined with existing quantization techniques.

8/27/2024