LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

2402.04644

Published 6/21/2024 by Yuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H. Chi and 1 other

cs.LG cs.AI

🏋️

Abstract

Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI (Layer-wise Ensemble of different VIews), where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving its efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.

Create account to get full access

Overview

Finetuning pre-trained models is a widely used technique for leveraging the power of large foundation models in new tasks.
However, recent studies have observed challenges in the generalization of finetuned models to unseen distributions (out-of-distribution; OOD).
To address this issue, some previous work has focused on limitations in the finetuning data and preserving the general representation from pre-training.
This paper argues that over-reliance on the pre-trained representation can hinder finetuning from learning essential representations for downstream tasks, especially when the new tasks are from different domains compared to pre-training.

Plain English Explanation

Large pre-trained models like GPT-3 or DALL-E have shown impressive capabilities across a wide range of tasks. To use these powerful models for a specific new task, a common approach is finetuning - taking the pre-trained model and further training it on a smaller dataset for the target task.

However, recent research has found that finetuned models can struggle to generalize to new situations that are different from the data they were finetuned on. For example, a language model finetuned on news articles may perform poorly when applied to scientific papers or social media posts.

Some prior work has tried to address this by carefully curating the finetuning dataset or regularizing the finetuning process to preserve the general knowledge from pre-training. But this paper argues that the issue may also lie in limitations of the pre-trained model itself. If the pre-training data and model have inherent biases or blind spots, the finetuned model may pick up on these problematic patterns rather than learning the truly essential representations for the new task.

To overcome these challenges in both pre-training and finetuning, the paper introduces a new method called LEVI (Layer-wise Ensemble of different VIews). The key idea is to adaptively ensemble the pre-trained model with a small task-specific model, allowing the two complementary views to suppress unhelpful features and preserve useful ones for good out-of-distribution generalization.

Technical Explanation

This paper proposes a novel finetuning method called LEVI (Layer-wise Ensemble of different VIews) to improve the out-of-distribution (OOD) generalization of finetuned models.

The authors argue that over-relying on the pre-trained representation can hinder finetuning from learning essential representations for downstream tasks, especially when the new tasks are from different (sub)domains compared to the pre-training data. To address the limitations in both pre-training and finetuning data, LEVI adaptively ensembles the pre-trained model layer-wise with a small task-specific model.

By combining these two complementary models, LEVI effectively suppresses problematic features in both the finetuning data and the pre-trained model, while preserving useful features for the new task. Broad experiments with large language and vision models show that LEVI greatly improves finetuning generalization by emphasizing different views from the finetuning data and pre-trained features.

Critical Analysis

The paper makes a compelling case that over-reliance on pre-trained representations can be a key limitation of finetuning, and that addressing issues in both the pre-training and finetuning data is crucial for improving OOD generalization.

The LEVI method proposed in the paper is an interesting approach to adaptively leverage the strengths of both the pre-trained model and a task-specific model. However, the authors acknowledge that LEVI introduces additional complexity and hyperparameters that need to be carefully tuned.

Another potential limitation is that the paper focuses mainly on evaluating LEVI on language and vision tasks. It would be valuable to explore its effectiveness in other domains, such as structured prediction or multi-modal tasks, where the mismatch between pre-training and finetuning data may manifest differently.

Additionally, the paper does not deeply explore the specific mechanisms by which LEVI suppresses problematic features and preserves useful ones. A more detailed analysis of the internal workings of the method could provide additional insights.

Overall, this paper makes an important contribution by highlighting the need to consider limitations in both pre-training and finetuning when designing effective finetuning approaches, and presents a promising step towards more calibrated and robust finetuning of large foundation models.

Conclusion

This paper presents a novel finetuning method called LEVI that aims to improve the out-of-distribution generalization of models by adaptively ensembling the pre-trained model with a task-specific model. By addressing limitations in both the pre-training and finetuning data, LEVI is able to suppress problematic features and preserve useful representations for the new task, leading to significant improvements in OOD performance across language and vision benchmarks.

The insights from this work highlight the importance of carefully considering the underlying characteristics of pre-trained models and finetuning data when leveraging large foundation models for specific applications. As the use of pre-trained models becomes increasingly widespread, developing robust and generalizable finetuning techniques will be crucial for unlocking their full potential across diverse real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Feature Protection For Out-of-distribution Generalization

Lu Tan, Huei Zhou, Yinxiang Huang, Zeming Zheng, Yujiu Yang

With the availability of large pre-trained models, a modern workflow for building real-world machine learning solutions is to fine-tune such models on a downstream task with a relatively small domain-specific dataset. In such applications, one major challenge is that the small fine-tuning dataset does not have sufficient coverage of the distribution encountered when the model is deployed. It is thus important to design fine-tuning methods that are robust to out-of-distribution (OOD) data that are under-represented by the training data. This paper compares common fine-tuning methods to investigate their OOD performance and demonstrates that standard methods will result in a significant change to the pre-trained model so that the fine-tuned features overfit the fine-tuning dataset. However, this causes deteriorated OOD performance. To overcome this issue, we show that protecting pre-trained features leads to a fine-tuned model more robust to OOD generalization. We validate the feature protection methods with extensive experiments of fine-tuning CLIP on ImageNet and DomainNet.

5/28/2024

cs.LG

📈

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.

4/17/2024

cs.CV cs.AI

🏅

New!Fine-tuning can cripple your foundation model; preserving features may be the solution

Jishnu Mukhoti, Yarin Gal, Philip H. S. Torr, Puneet K. Dokania

Pre-trained foundation models, due to their enormous capacity and exposure to vast amounts of data during pre-training, are known to have learned plenty of real-world concepts. An important step in making these pre-trained models effective on downstream tasks is to fine-tune them on related datasets. While various fine-tuning methods have been devised and have been shown to be highly effective, we observe that a fine-tuned model's ability to recognize concepts on tasks $textit{different}$ from the downstream one is reduced significantly compared to its pre-trained counterpart. This is an undesirable effect of fine-tuning as a substantial amount of resources was used to learn these pre-trained concepts in the first place. We call this phenomenon ''concept forgetting'' and via experiments show that most end-to-end fine-tuning approaches suffer heavily from this side effect. To this end, we propose a simple fix to this problem by designing a new fine-tuning method called $textit{LDIFS}$ (short for $ell_2$ distance in feature space) that, while learning new concepts related to the downstream task, allows a model to preserve its pre-trained knowledge as well. Through extensive experiments on 10 fine-tuning tasks we show that $textit{LDIFS}$ significantly reduces concept forgetting. Additionally, we show that LDIFS is highly effective in performing continual fine-tuning on a sequence of tasks as well, in comparison with both fine-tuning as well as continual learning baselines.

7/2/2024

cs.LG cs.CV

📈

An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration

Hiroki Naganuma, Ryuichiro Hataya, Ioannis Mitliagkas

In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. Different from most prior work that has focused on advancing learning algorithms, we systematically examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration on downstream tasks. We evaluated 100 models across diverse pre-trained model sizes, update{five} pre-training datasets, and five data augmentations through extensive experiments on four distribution shift datasets totaling over 120,000 GPU hours. Our results demonstrate the significant impact of pre-trained model selection, with optimal choices substantially improving OOD accuracy over algorithm improvement alone. We find larger models and bigger pre-training data improve OOD performance and calibration, in contrast to some prior studies that found modern deep networks to calibrate worse than classical shallow models. Our work underscores the overlooked importance of pre-trained model selection for out-of-distribution generalization and calibration.

6/3/2024

cs.LG cs.AI