Fine-tuning can cripple your foundation model; preserving features may be the solution

2308.13320

Published 7/2/2024 by Jishnu Mukhoti, Yarin Gal, Philip H. S. Torr, Puneet K. Dokania

🏅

Abstract

Pre-trained foundation models, due to their enormous capacity and exposure to vast amounts of data during pre-training, are known to have learned plenty of real-world concepts. An important step in making these pre-trained models effective on downstream tasks is to fine-tune them on related datasets. While various fine-tuning methods have been devised and have been shown to be highly effective, we observe that a fine-tuned model's ability to recognize concepts on tasks $textit{different}$ from the downstream one is reduced significantly compared to its pre-trained counterpart. This is an undesirable effect of fine-tuning as a substantial amount of resources was used to learn these pre-trained concepts in the first place. We call this phenomenon ''concept forgetting'' and via experiments show that most end-to-end fine-tuning approaches suffer heavily from this side effect. To this end, we propose a simple fix to this problem by designing a new fine-tuning method called $textit{LDIFS}$ (short for $ell_2$ distance in feature space) that, while learning new concepts related to the downstream task, allows a model to preserve its pre-trained knowledge as well. Through extensive experiments on 10 fine-tuning tasks we show that $textit{LDIFS}$ significantly reduces concept forgetting. Additionally, we show that LDIFS is highly effective in performing continual fine-tuning on a sequence of tasks as well, in comparison with both fine-tuning as well as continual learning baselines.

Create account to get full access

Overview

Pre-trained foundation models have learned a wealth of real-world concepts from their exposure to vast amounts of data during pre-training.
Fine-tuning these models on related datasets is an important step to make them effective on downstream tasks.
However, fine-tuning can lead to a significant reduction in the model's ability to recognize concepts unrelated to the downstream task, a phenomenon called "concept forgetting."
The paper proposes a new fine-tuning method called LDIFS that aims to preserve the model's pre-trained knowledge while learning new concepts for the downstream task.

Plain English Explanation

Large AI models, like those used for language processing or image recognition, are often pre-trained on a massive amount of data to learn general real-world concepts. This allows them to be highly capable at a wide variety of tasks. However, to perform well on a specific task, these models need to be further refined or "fine-tuned" on a smaller dataset related to that task.

The problem is that this fine-tuning process can cause the model to "forget" some of the general knowledge it had learned during the initial pre-training. It becomes focused on the new task at the expense of its broader understanding. This "concept forgetting" is undesirable, as a lot of effort went into building up that initial knowledge.

The researchers propose a new fine-tuning method called LDIFS that aims to address this issue. LDIFS allows the model to learn the new task-specific concepts while also preserving its pre-trained knowledge. Through experiments, the researchers show that LDIFS is effective at reducing concept forgetting compared to traditional fine-tuning approaches.

This is an important advancement, as it means we can fine-tune powerful AI models for specific applications without losing their broad understanding of the world. This could lead to more capable and versatile AI systems that can fluidly adapt to different tasks while still maintaining a deep knowledge base.

Technical Explanation

The paper explores the phenomenon of "concept forgetting" that occurs when pre-trained foundation models are fine-tuned on downstream tasks. The authors observe that while fine-tuning improves performance on the target task, it significantly reduces the model's ability to recognize concepts unrelated to that task, despite the substantial resources used to learn those pre-trained concepts.

To address this issue, the researchers propose a new fine-tuning method called LDIFS (short for "L2 distance in feature space"). LDIFS aims to preserve the model's pre-trained knowledge while enabling it to learn new concepts related to the downstream task.

Through extensive experiments on 10 fine-tuning tasks, the authors demonstrate that LDIFS significantly reduces concept forgetting compared to standard end-to-end fine-tuning approaches. They also show that LDIFS is highly effective in performing continual fine-tuning on a sequence of tasks, outperforming both fine-tuning and continual learning baselines.

The key insight behind LDIFS is to constrain the fine-tuning process to maintain the model's feature representations, which encode its pre-trained knowledge, while still allowing the model to learn new task-specific concepts. This is achieved by adding a regularization term to the fine-tuning objective that encourages the model to minimize the L2 distance between its pre-trained and fine-tuned feature representations.

The authors also explore the relationship between fine-tuning and supervised pre-training, and how different layer-wise fine-tuning strategies can impact the preservation of pre-trained knowledge.

Critical Analysis

The paper presents a compelling solution to the concept forgetting problem, which is an important challenge in the effective use of pre-trained foundation models. The proposed LDIFS method seems to be a straightforward yet effective way to maintain the model's broad knowledge while still allowing it to learn new task-specific concepts.

One potential limitation of the research is the focus on a relatively narrow set of fine-tuning tasks. While the authors demonstrate the effectiveness of LDIFS across 10 tasks, it would be valuable to see how the method performs on an even broader range of applications, including more diverse and challenging tasks.

Additionally, the paper does not delve deeply into the underlying reasons why concept forgetting occurs in the first place. A more thorough investigation of the mechanisms behind this phenomenon could lead to further insights and potentially even more effective solutions.

Finally, the authors mention the potential for LDIFS to be used in continual fine-tuning scenarios, but they do not explore this application in depth. Exploring the long-term effectiveness of LDIFS in continual learning settings could be an interesting area for future research.

Overall, the paper presents a practical and promising solution to an important problem in the field of foundation model fine-tuning. The LDIFS method could have significant implications for the development of more capable and versatile AI systems.

Conclusion

This paper addresses the concept forgetting problem that arises when pre-trained foundation models are fine-tuned on downstream tasks. The researchers propose a new fine-tuning method called LDIFS that effectively preserves the model's pre-trained knowledge while enabling it to learn new task-specific concepts.

The key contribution of this work is demonstrating that it is possible to fine-tune powerful AI models without sacrificing their broad understanding of the world. This could lead to the development of more capable and adaptable AI systems that can fluidly handle a wide variety of tasks while maintaining a deep knowledge base.

The findings of this paper have the potential to significantly impact the way we fine-tune and deploy pre-trained foundation models, potentially unlocking new possibilities for the application of AI technology across diverse domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏋️

Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts

Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin-Yan Han, Yu-Feng Li

The fine-tuning paradigm in addressing long-tail learning tasks has sparked significant interest since the emergence of foundation models. Nonetheless, how fine-tuning impacts performance in long-tail learning was not explicitly quantified. In this paper, we disclose that heavy fine-tuning may even lead to non-negligible performance deterioration on tail classes, and lightweight fine-tuning is more effective. The reason is attributed to inconsistent class conditions caused by heavy fine-tuning. With the observation above, we develop a low-complexity and accurate long-tail learning algorithms LIFT with the goal of facilitating fast prediction and compact models by adaptive lightweight fine-tuning. Experiments clearly verify that both the training time and the learned parameters are significantly reduced with more accurate predictive performance compared with state-of-the-art approaches. The implementation code is available at https://github.com/shijxcs/LIFT.

6/4/2024

cs.CV cs.LG

Feature Protection For Out-of-distribution Generalization

Lu Tan, Huei Zhou, Yinxiang Huang, Zeming Zheng, Yujiu Yang

With the availability of large pre-trained models, a modern workflow for building real-world machine learning solutions is to fine-tune such models on a downstream task with a relatively small domain-specific dataset. In such applications, one major challenge is that the small fine-tuning dataset does not have sufficient coverage of the distribution encountered when the model is deployed. It is thus important to design fine-tuning methods that are robust to out-of-distribution (OOD) data that are under-represented by the training data. This paper compares common fine-tuning methods to investigate their OOD performance and demonstrates that standard methods will result in a significant change to the pre-trained model so that the fine-tuned features overfit the fine-tuning dataset. However, this causes deteriorated OOD performance. To overcome this issue, we show that protecting pre-trained features leads to a fine-tuned model more robust to OOD generalization. We validate the feature protection methods with extensive experiments of fine-tuning CLIP on ImageNet and DomainNet.

5/28/2024

cs.LG

Supervised Fine-tuning in turn Improves Visual Foundation Models

Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.

4/12/2024

cs.CV cs.AI

🏋️

LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI (Layer-wise Ensemble of different VIews), where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving its efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.

6/21/2024

cs.LG cs.AI