VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness

Read original: arXiv:2401.07853 - Published 4/16/2024 by Rongyu Zhang, Zefan Cai, Huanrui Yang, Zidong Liu, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Baobao Chang, Yuan Du and 2 others

🏋️

Overview

A novel approach called Vision-language Collaborative Active Finetuning (VeCAF) is proposed to address the diminished training efficiency in the conventional finetuning process for pretrained vision models (PVMs).
VeCAF leverages the availability of labels and natural language annotations of images to perform parametric data selection for PVM finetuning, leading to faster convergence and better performance.
The approach utilizes the semantic richness of text embeddings to augment image features, and allows for handling of out-of-distribution scenarios without external data.

Plain English Explanation

When training pretrained vision models (PVMs) on new tasks, the common approach is to "finetune" the model by training it on a set of randomly selected data points. However, this conventional finetuning process can be inefficient, as the randomly selected data may not be the most effective for quickly training the model to perform well on the new task.

The researchers behind this paper propose a new approach called Vision-language Collaborative Active Finetuning (VeCAF). VeCAF makes use of the growing availability of labeled and annotated images, often collected through web-crawling or controlled generation, to select the most informative data points for finetuning the PVM. This "parametric data selection" process helps the model converge to the desired performance level more quickly.

VeCAF also leverages the rich semantic information encoded in text embeddings to enhance the image features used for finetuning. This allows the model to better handle situations where the test data is different from the training data (known as "out-of-distribution" scenarios) without needing additional external data.

Technical Explanation

The key innovation of VeCAF is the incorporation of the finetuning objective into the data selection process. By selecting the data points that are most informative for the target task, VeCAF can guide the PVM towards faster convergence to the desired performance level.

The text embeddings used in VeCAF are leveraged to augment the image features, exploiting the inherent semantic richness of the text domain. This text-domain augmentation enhances the model's ability to handle out-of-distribution scenarios, as the text-based information can provide valuable context and cues that are not present in the images alone.

Extensive experiments show that VeCAF outperforms baseline finetuning methods in both in-distribution and out-of-distribution image classification tasks. On the ImageNet dataset, VeCAF uses up to 3.3x fewer training batches to reach the target performance compared to full finetuning, and achieves a 2.7% accuracy improvement over the state-of-the-art active finetuning method using the same number of batches.

Critical Analysis

The paper presents a compelling approach to improving the efficiency and performance of finetuning pretrained vision models. However, some potential limitations and areas for further research are worth considering:

The paper focuses on image classification tasks, but it would be interesting to see how VeCAF performs on other downstream vision tasks, such as object detection or segmentation.
The reliance on external language annotations may not be feasible in all scenarios, especially for specialized or niche domains. Investigating ways to leverage internal dataset information or self-supervised pretraining could further improve the accessibility and applicability of the approach.
While the text-domain augmentation enhances out-of-distribution performance, the paper does not explore the limits of this capability. Evaluating the approach on more challenging or diverse out-of-distribution scenarios could provide deeper insights.

Overall, the VeCAF approach represents a promising step forward in improving the efficiency and effectiveness of finetuning pretrained vision models, particularly in leveraging the rich information available in text annotations.

Conclusion

The Vision-language Collaborative Active Finetuning (VeCAF) approach proposed in this paper offers a novel solution to the diminished training efficiency problem in the conventional finetuning process for pretrained vision models. By utilizing the availability of labeled and annotated image data, VeCAF can select the most informative data points for finetuning, leading to faster convergence and better performance. The incorporation of text-domain augmentation further enhances the model's ability to handle out-of-distribution scenarios, making VeCAF a versatile and compelling approach for improving the downstream application of pretrained vision models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness

Rongyu Zhang, Zefan Cai, Huanrui Yang, Zidong Liu, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Baobao Chang, Yuan Du, Li Du, Shanghang Zhang

Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. However, the conventional finetuning process with randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision-language Collaborative Active Finetuning (VeCAF). With the emerging availability of labels and natural language annotations of images through web-scale crawling or controlled generation, VeCAF makes use of these information to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence to meet the performance goal. This process is assisted by the inherent semantic richness of the text embedding space which we use to augment image features. Furthermore, the flexibility of text-domain augmentation allows VeCAF to handle out-of-distribution scenarios without external data. Extensive experiments show the leading performance and high computational efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning, and achieves an accuracy improvement of 2.7% over the state-of-the-art active finetuning method with the same number of batches.

4/16/2024

🛸

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model's parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a textbf{C}ollabotextbf{ra}tive textbf{F}ine-textbf{T}uning (textbf{CraFT}) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the white-box method. Our code is publicly available at https://github.com/mrflogs/CraFT .

6/4/2024

Supervised Fine-tuning in turn Improves Visual Foundation Models

Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.

4/12/2024

Probing the Efficacy of Federated Parameter-Efficient Fine-Tuning of Vision Transformers for Medical Image Classification

Naif Alkhunaizi, Faris Almalik, Rouqaiah Al-Refai, Muzammal Naseer, Karthik Nandakumar

With the advent of large pre-trained transformer models, fine-tuning these models for various downstream tasks is a critical problem. Paucity of training data, the existence of data silos, and stringent privacy constraints exacerbate this fine-tuning problem in the medical imaging domain, creating a strong need for algorithms that enable collaborative fine-tuning of pre-trained models. Moreover, the large size of these models necessitates the use of parameter-efficient fine-tuning (PEFT) to reduce the communication burden in federated learning. In this work, we systematically investigate various federated PEFT strategies for adapting a Vision Transformer (ViT) model (pre-trained on a large natural image dataset) for medical image classification. Apart from evaluating known PEFT techniques, we introduce new federated variants of PEFT algorithms such as visual prompt tuning (VPT), low-rank decomposition of visual prompts, stochastic block attention fine-tuning, and hybrid PEFT methods like low-rank adaptation (LoRA)+VPT. Moreover, we perform a thorough empirical analysis to identify the optimal PEFT method for the federated setting and understand the impact of data distribution on federated PEFT, especially for out-of-domain (OOD) and non-IID data. The key insight of this study is that while most federated PEFT methods work well for in-domain transfer, there is a substantial accuracy vs. efficiency trade-off when dealing with OOD and non-IID scenarios, which is commonly the case in medical imaging. Specifically, every order of magnitude reduction in fine-tuned/exchanged parameters can lead to a 4% drop in accuracy. Thus, the initial model choice is crucial for federated PEFT. It is preferable to use medical foundation models learned from in-domain medical image data (if available) rather than general vision models.

7/17/2024