An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates

Read original: arXiv:2405.18069 - Published 5/29/2024 by Albin Soutif--Cormerais, Simone Magistri, Joost van de Weijer, Andew D. Bagdanov

An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates

Overview

This paper analyzes how pre-trained AI models forget information when updated incrementally using a low-rank approach, a common technique for fine-tuning large language models.
The researchers examine the forgetting behavior of these models and propose strategies to mitigate it, which could lead to more stable and reliable AI systems.

Plain English Explanation

AI models like those used for language processing are often pre-trained on large datasets, giving them a broad base of knowledge. When these models are then fine-tuned on a specific task, such as answering questions or summarizing text, the standard approach is to make small updates to the model's parameters, or internal workings, rather than retraining the entire model from scratch.

This process, known as low-rank adaptation, can be efficient, but it also raises concerns about the model forgetting or losing the knowledge it had before the fine-tuning. The researchers in this paper set out to empirically study this "forgetting" phenomenon, exploring how it varies across different types of pre-trained models and fine-tuning approaches.

Their findings suggest that pre-trained models do indeed tend to forget previously learned information when updated incrementally, but that the degree of forgetting can be mitigated through careful design of the fine-tuning process. For example, the paper discusses techniques like batched low-rank adaptation and fairness-aware low-rank adaptation that can help preserve a model's original knowledge while still allowing for efficient fine-tuning.

By better understanding how forgetting occurs in these models, the researchers hope to develop more interference-free and adaptive fine-tuning techniques, leading to AI systems that are more stable, reliable, and capable of continual learning.

Technical Explanation

The paper empirically analyzes the forgetting behavior of pre-trained models when updated incrementally using low-rank adaptation techniques. The researchers evaluate various pre-trained models, including BERT, RoBERTa, and GPT-2, and examine how their performance on a range of tasks changes as they are fine-tuned on new datasets.

The key insights from their experiments include:

Forgetting is a significant issue: The pre-trained models tend to forget previously learned information, with their performance on initial tasks declining as they are fine-tuned on new tasks.
Forgetting varies across models and tasks: The degree of forgetting observed depends on factors like the model architecture, the pre-training dataset, and the specific tasks involved.
Strategies can mitigate forgetting: Techniques like batched low-rank adaptation and fairness-aware low-rank adaptation can help preserve a model's original knowledge during fine-tuning, reducing the extent of forgetting.

The researchers also discuss the potential for interference-free and adaptive fine-tuning approaches to further address the forgetting problem and enable more stable and capable AI systems.

Critical Analysis

The paper provides a thorough empirical analysis of forgetting in pre-trained models, highlighting an important challenge in the field of continual learning. The researchers acknowledge that their study is limited to a specific set of models and tasks, and that further investigation is needed to fully understand the factors that influence forgetting behavior.

One potential criticism is that the paper does not delve deeper into the underlying mechanisms driving forgetting in these models. A more detailed exploration of the neural dynamics and architectural properties that contribute to forgetting could help inform the development of more principled solutions.

Additionally, while the proposed strategies for mitigating forgetting, such as batched low-rank adaptation and fairness-aware low-rank adaptation, show promise, their effectiveness may be task-dependent. Further research is needed to understand the general applicability and limitations of these approaches.

Overall, the paper makes an important contribution to the understanding of forgetting in pre-trained models and highlights the need for continued innovation in fine-tuning techniques to enable more robust and reliable AI systems.

Conclusion

This paper presents an empirical analysis of the forgetting behavior observed in pre-trained AI models when updated incrementally using low-rank adaptation techniques. The researchers demonstrate that forgetting is a significant issue, with the degree of forgetting varying across different models and tasks.

The insights from this work can inform the development of more sophisticated fine-tuning strategies, such as interference-free and adaptive approaches, that can help preserve a model's original knowledge while still allowing for efficient updates. By addressing the forgetting problem, researchers can create AI systems that are more stable, reliable, and capable of continual learning, with broader implications for the field of artificial intelligence and its real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates

Albin Soutif--Cormerais, Simone Magistri, Joost van de Weijer, Andew D. Bagdanov

Broad, open source availability of large pretrained foundation models on the internet through platforms such as HuggingFace has taken the world of practical deep learning by storm. A classical pipeline for neural network training now typically consists of finetuning these pretrained network on a small target dataset instead of training from scratch. In the case of large models this can be done even on modest hardware using a low rank training technique known as Low-Rank Adaptation (LoRA). While Low Rank training has already been studied in the continual learning setting, existing works often consider storing the learned adapter along with the existing model but rarely attempt to modify the weights of the pretrained model by merging the LoRA with the existing weights after finishing the training of each task. In this article we investigate this setting and study the impact of LoRA rank on the forgetting of the pretraining foundation task and on the plasticity and forgetting of subsequent ones. We observe that this rank has an important impact on forgetting of both the pretraining and downstream tasks. We also observe that vision transformers finetuned in that way exhibit a sort of ``contextual'' forgetting, a behaviour that we do not observe for residual networks and that we believe has not been observed yet in previous continual learning works.

5/29/2024

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal

The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.

7/30/2024

123

LoRA Learns Less and Forgets Less

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approximately 100K prompt-response pairs) and continued pretraining (20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. We show that LoRA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

9/24/2024

Continual Forgetting for Pre-trained Vision Models

Hongbo Zhao, Bolin Ni, Haochen Wang, Junsong Fan, Fei Zhu, Yuxi Wang, Yuntao Chen, Gaofeng Meng, Zhaoxiang Zhang

For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on url{https://github.com/bjzhb666/GS-LoRA}.

7/19/2024