Anchor-based Robust Finetuning of Vision-Language Models

2404.06244

Published 4/10/2024 by Jinwei Han, Zhiwen Lin, Zhongyisun Sun, Yingguo Gao, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia

cs.CV

Anchor-based Robust Finetuning of Vision-Language Models

Abstract

We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD) generalization. We address two types of OOD generalization, i.e., i) domain shift such as natural to sketch images, and ii) zero-shot capability to recognize the category that was not contained in the finetune data. Arguably, the diminished OOD generalization after finetuning stems from the excessively simplified finetuning target, which only provides the class information, such as ``a photo of a [CLASS]''. This is distinct from the process in that CLIP was pretrained, where there is abundant text supervision with rich semantic information. Therefore, we propose to compensate for the finetune process using auxiliary supervision with rich semantic information, which acts as anchors to preserve the OOD generalization. Specifically, two types of anchors are elaborated in our method, including i) text-compensated anchor which uses the images from the finetune set but enriches the text supervision from a pretrained captioner, ii) image-text-pair anchor which is retrieved from the dataset similar to pretraining data of CLIP according to the downstream task, associating with the original CLIP text with rich semantics. Those anchors are utilized as auxiliary semantic information to maintain the original feature space of CLIP, thereby preserving the OOD generalization capabilities. Comprehensive experiments demonstrate that our method achieves in-distribution performance akin to conventional finetuning while attaining new state-of-the-art results on domain shift and zero-shot learning benchmarks.

Create account to get full access

Overview

This paper proposes a novel approach called "anchor-based robust finetuning" for improving the performance and robustness of vision-language models.
The key idea is to use "anchor" examples during finetuning to guide the model towards more stable and generalizable representations.
The authors demonstrate the effectiveness of their method on several vision-language benchmarks, including improved zero-shot classification by adapting VLMs, audio-visual generalized zero-shot learning, and fine-grained open-set classification.

Plain English Explanation

The paper introduces a new technique called "anchor-based robust finetuning" to improve the performance and reliability of vision-language models. These models are trained to understand the relationship between visual information (like images) and language (like text).

The key idea is to use "anchor" examples during the finetuning process. Anchor examples are carefully selected images and text that serve as stable reference points for the model. By aligning the model's representations to these anchors, it becomes more robust and generalizable, meaning it can perform well on a wider range of tasks and data, not just the specific examples it was trained on.

The authors show that this anchor-based approach leads to better results than standard finetuning on several real-world benchmarks, like improving zero-shot classification (recognizing new classes without seeing examples), audio-visual generalized zero-shot learning (connecting sound, image, and text), and fine-grained open-set classification (distinguishing between similar object categories).

Technical Explanation

The paper introduces a novel finetuning approach called "anchor-based robust finetuning" for vision-language models. The key idea is to leverage carefully selected "anchor" examples during the finetuning process to guide the model towards more stable and generalizable representations.

Specifically, the authors propose an anchor-based finetuning objective that encourages the model to align its representations of the anchor examples with pre-defined target representations. This helps the model learn more robust features that are invariant to spurious correlations in the training data.

The authors evaluate their anchor-based finetuning method on several vision-language benchmarks, including improved zero-shot classification by adapting VLMs, audio-visual generalized zero-shot learning, and fine-grained open-set classification. They demonstrate that anchor-based finetuning outperforms standard finetuning approaches, leading to improved performance and robustness on these tasks.

Critical Analysis

The paper presents a promising approach for improving the robustness and generalization of vision-language models through anchor-based finetuning. The authors provide a thorough experimental evaluation, demonstrating the effectiveness of their method on several challenging benchmarks.

However, the paper does not address some potential limitations and areas for further research. For example, the authors do not investigate the sensitivity of their method to the selection of anchor examples or the impact of different anchor design strategies. Additionally, it would be valuable to understand the computational overhead and training time requirements of the anchor-based finetuning approach compared to standard finetuning.

Further research could also explore the application of anchor-based finetuning to other vision-language tasks, such as simple recipe language-guided domain generalized segmentation or mixture of low-rank experts for transferable AI-generated content, to assess its broader applicability and potential limitations.

Conclusion

This paper presents a novel anchor-based approach for robust finetuning of vision-language models. By aligning model representations to carefully selected anchor examples, the authors demonstrate improved performance and robustness on several challenging benchmarks, including zero-shot classification, generalized zero-shot learning, and fine-grained open-set classification.

The anchor-based finetuning technique offers a promising direction for enhancing the reliability and generalization of vision-language models, with potential applications across a wide range of real-world tasks. Further research is needed to fully understand the limitations and broader implications of this approach, but the findings presented in this paper are an important step forward in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📈

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.

4/17/2024

cs.CV cs.AI

✨

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, Kyungwoo Song

Improving out-of-distribution (OOD) generalization through in-distribution (ID) adaptation is a primary goal of robust fine-tuning methods beyond the naive fine-tuning approach. However, despite decent OOD generalization performance from recent robust fine-tuning methods, OOD confidence calibration for reliable machine learning has not been fully addressed. This work proposes a robust fine-tuning method that improves both OOD accuracy and calibration error in Vision Language Models (VLMs). Firstly, we show that both types of errors have a shared upper bound consisting of two terms of ID data: 1) calibration error and 2) the smallest singular value of the input covariance matrix. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value, which is further aided by the self-distillation of a moving averaged model to achieve well-calibrated prediction. Starting from an empirical validation of our theoretical statements, we provide extensive experimental results on ImageNet distribution shift benchmarks that demonstrate the effectiveness of our method.

5/28/2024

cs.CV cs.AI

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

Jiang-Xin Shi, Chi Zhang, Tong Wei, Yu-Feng Li

Pre-trained vision-language models like CLIP have shown powerful zero-shot inference ability via image-text matching and prove to be strong few-shot learners in various downstream tasks. However, in real-world scenarios, adapting CLIP to downstream tasks may encounter the following challenges: 1) data may exhibit long-tailed data distributions and might not have abundant samples for all the classes; 2) There might be emerging tasks with new classes that contain no samples at all. To overcome them, we propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle. During the training process, we propose compensating logit-adjusted loss to encourage large margins of prototypes and alleviate imbalance both within the base classes and between the base and new classes. For efficient adaptation, we treat the CLIP model as a black box and leverage the extracted features to obtain visual and textual prototypes for prediction. To make full use of multi-modal information, we also propose cross-modal attention to enrich the features from both modalities. For effective generalization, we introduce virtual prototypes for new classes to make up for their lack of training images. Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets while substantially reducing the training time, demonstrating the superiority of our approach. The source code is available at https://github.com/shijxcs/Candle.

6/19/2024

cs.CV cs.LG

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the down-stream LVLMs is required. The code and robust models are available at https://github.com/chs20/RobustVLM

6/6/2024

cs.LG cs.AI cs.CV stat.ML