Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

2401.15914

Published 4/17/2024 by Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

📈

Abstract

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.

Create account to get full access

Overview

Current vision-language models can recognize a wide range of visual concepts, but struggle with "open-domain" concepts that are not in their training data.
Recent finetuning approaches like prompt learning have shown some improvements in handling both in-distribution and out-of-distribution samples.
However, the paper finds that without proper regularization, these models tend to overfit the known classes in the dataset, hurting performance on unknown classes.

Plain English Explanation

Vision-language models, which can understand both images and text, have become very capable at recognizing a wide variety of visual concepts. However, these models are limited in that they can only recognize the specific things they were trained on. When shown something completely new, they struggle.

Recent techniques like prompt learning have helped these models handle both the concepts they were trained on ("in-distribution") as well as new, unseen concepts ("out-of-distribution") a bit better. But the paper finds that if you keep training these models on just the known concepts, they tend to get too specialized and actually perform worse on the new, unknown concepts.

Technical Explanation

The key idea in this paper is to introduce a class-conditional feature generator that can synthesize features for unknown classes, and use these to help the model learn a better decision boundary between known and unknown concepts.

Specifically:

The feature generator takes just the class name as input and generates representative features for that class.
These synthetic features are then used, along with the real features from the known classes, to train the model.
This helps the model learn to better distinguish known from unknown classes, improving its "out-of-distribution generalization".

The paper also introduces an adaptive self-distillation mechanism to further regularize the feature generator and prevent overfitting.

Critical Analysis

The paper makes a valid point that existing finetuning approaches for vision-language models can lead to overfitting on the known classes, hurting performance on unknown classes. The proposed OGEN method seems like a reasonable approach to address this, leveraging synthetic features to improve out-of-distribution generalization.

However, the effectiveness of this approach likely depends on the quality of the synthetic features generated. If the feature generator struggles to produce representative features for unknown classes, it may not provide much benefit. There could also be computational and memory overhead from running the feature generator.

Additionally, the paper does not delve into other potential approaches for improving out-of-distribution robustness, such as learning to detect OOD samples or using perturbation-based techniques. Exploring a wider range of methods could lead to more insights.

Conclusion

This paper identifies an important limitation of current vision-language models - their tendency to overfit to known classes, hurting performance on unknown concepts. The proposed OGEN approach attempts to address this by generating synthetic features for unknown classes and using them to regularize the model.

While the core idea seems promising, the effectiveness likely depends on the quality of the synthetic features. Exploring a broader range of techniques for improving out-of-distribution robustness could lead to further advancements in making these powerful models more versatile and reliable, as discussed in related work.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

✨

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, Kyungwoo Song

Improving out-of-distribution (OOD) generalization through in-distribution (ID) adaptation is a primary goal of robust fine-tuning methods beyond the naive fine-tuning approach. However, despite decent OOD generalization performance from recent robust fine-tuning methods, OOD confidence calibration for reliable machine learning has not been fully addressed. This work proposes a robust fine-tuning method that improves both OOD accuracy and calibration error in Vision Language Models (VLMs). Firstly, we show that both types of errors have a shared upper bound consisting of two terms of ID data: 1) calibration error and 2) the smallest singular value of the input covariance matrix. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value, which is further aided by the self-distillation of a moving averaged model to achieve well-calibrated prediction. Starting from an empirical validation of our theoretical statements, we provide extensive experimental results on ImageNet distribution shift benchmarks that demonstrate the effectiveness of our method.

5/28/2024

cs.CV cs.AI

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

Lin Zhu, Yifeng Yang, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye

Recent vision-language pre-trained models (VL-PTMs) have shown remarkable success in open-vocabulary tasks. However, downstream use cases often involve further fine-tuning of VL-PTMs, which may distort their general knowledge and impair their ability to handle distribution shifts. In real-world scenarios, machine learning systems inevitably encounter both covariate shifts (e.g., changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of enhancing out-of-distribution (OOD) generalization on covariate shifts and simultaneously detecting semantic-shifted unseen classes. Thus a critical but underexplored question arises: How to improve VL-PTMs' generalization ability to closed-set OOD data, while effectively detecting open-set unseen classes during fine-tuning? In this paper, we propose a novel objective function of OOD detection that also serves to improve OOD generalization. We show that minimizing the gradient magnitude of energy scores on training data leads to domain-consistent Hessians of classification loss, a strong indicator for OOD generalization revealed by theoretical analysis. Based on this finding, we have developed a unified fine-tuning framework that allows for concurrent optimization of both tasks. Extensive experiments have demonstrated the superiority of our method. The code is available at https://github.com/LinLLLL/CRoFT.

5/28/2024

cs.CV

Feature Protection For Out-of-distribution Generalization

Lu Tan, Huei Zhou, Yinxiang Huang, Zeming Zheng, Yujiu Yang

With the availability of large pre-trained models, a modern workflow for building real-world machine learning solutions is to fine-tune such models on a downstream task with a relatively small domain-specific dataset. In such applications, one major challenge is that the small fine-tuning dataset does not have sufficient coverage of the distribution encountered when the model is deployed. It is thus important to design fine-tuning methods that are robust to out-of-distribution (OOD) data that are under-represented by the training data. This paper compares common fine-tuning methods to investigate their OOD performance and demonstrates that standard methods will result in a significant change to the pre-trained model so that the fine-tuned features overfit the fine-tuning dataset. However, this causes deteriorated OOD performance. To overcome this issue, we show that protecting pre-trained features leads to a fine-tuned model more robust to OOD generalization. We validate the feature protection methods with extensive experiments of fine-tuning CLIP on ImageNet and DomainNet.

5/28/2024

cs.LG

Anchor-based Robust Finetuning of Vision-Language Models

Jinwei Han, Zhiwen Lin, Zhongyisun Sun, Yingguo Gao, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia

We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD) generalization. We address two types of OOD generalization, i.e., i) domain shift such as natural to sketch images, and ii) zero-shot capability to recognize the category that was not contained in the finetune data. Arguably, the diminished OOD generalization after finetuning stems from the excessively simplified finetuning target, which only provides the class information, such as ``a photo of a [CLASS]''. This is distinct from the process in that CLIP was pretrained, where there is abundant text supervision with rich semantic information. Therefore, we propose to compensate for the finetune process using auxiliary supervision with rich semantic information, which acts as anchors to preserve the OOD generalization. Specifically, two types of anchors are elaborated in our method, including i) text-compensated anchor which uses the images from the finetune set but enriches the text supervision from a pretrained captioner, ii) image-text-pair anchor which is retrieved from the dataset similar to pretraining data of CLIP according to the downstream task, associating with the original CLIP text with rich semantics. Those anchors are utilized as auxiliary semantic information to maintain the original feature space of CLIP, thereby preserving the OOD generalization capabilities. Comprehensive experiments demonstrate that our method achieves in-distribution performance akin to conventional finetuning while attaining new state-of-the-art results on domain shift and zero-shot learning benchmarks.

4/10/2024

cs.CV