Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

2402.04050

Published 6/4/2024 by Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

🛸

Abstract

With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model's parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a textbf{C}ollabotextbf{ra}tive textbf{F}ine-textbf{T}uning (textbf{CraFT}) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the white-box method. Our code is publicly available at https://github.com/mrflogs/CraFT .

Create account to get full access

Overview

This paper proposes a new approach called CraFT (Collaborative Fine-Tuning) for fine-tuning pretrained vision-language models (VLMs) when their internal parameters are not accessible.
CraFT comprises two modules: a prompt generation module for learning effective text prompts, and a prediction refinement module for enhancing the model's output predictions.
The paper also introduces an auxiliary loss to promote consistent optimization across these modules, and a collaborative training algorithm to jointly optimize them.
Experiments on few-shot classification tasks demonstrate that CraFT outperforms previous black-box fine-tuning methods while using significantly less memory and training time.

Plain English Explanation

Large language-vision models like DALL-E and Stable Diffusion have shown impressive capabilities, but they are often released as "black boxes" where you can't access their internal parameters. This can make it challenging to fine-tune these models for specific tasks.

The researchers developed a new approach called CraFT that allows you to fine-tune these black-box VLMs without access to their internal parameters. CraFT has two key components:

A prompt generation module that learns effective text prompts to feed into the VLM. This helps the model produce better outputs for the target task.
A prediction refinement module that takes the VLM's initial predictions and enhances them, again to improve performance on the target task.

These modules are trained collaboratively using a novel algorithm, along with an additional loss function to keep them aligned.

The researchers showed that CraFT can achieve significant performance gains on few-shot classification tasks, while using much less memory and training time compared to previous black-box fine-tuning methods. This makes CraFT a practical and efficient way to customize large VLMs for specific applications, even when you can't access their internal workings.

Technical Explanation

The key challenge addressed by this paper is fine-tuning large pretrained vision-language models (VLMs) when their internal parameters are not accessible (i.e., they are provided as "black boxes").

To tackle this, the authors propose the CraFT (Collaborative Fine-Tuning) approach, which comprises two key modules:

Prompt Generation Module: This module learns effective text prompts to feed into the black-box VLM, with the goal of producing better outputs for the target task.
Prediction Refinement Module: This module takes the VLM's initial predictions and refines them, again to improve performance on the target task.

The authors also introduce an auxiliary prediction-consistent loss to promote consistent optimization across these two modules. A novel collaborative training algorithm is used to jointly optimize the entire CraFT framework.

Extensive experiments on few-shot classification tasks across 15 datasets show that CraFT outperforms previous black-box fine-tuning methods. Specifically, CraFT achieves a 12% gain on 16-shot datasets using only 8,000 queries. Furthermore, CraFT trains faster and uses 80x less memory for deployment, while sacrificing only 1.62% compared to a white-box fine-tuning approach.

Critical Analysis

The CraFT approach presented in this paper is a clever solution to the challenge of fine-tuning black-box VLMs. By separating the fine-tuning process into prompt generation and prediction refinement, the authors are able to effectively customize the model's behavior without requiring access to its internal parameters.

One potential limitation of the approach is that it may not be as effective for tasks that require more substantial modifications to the VLM's underlying knowledge and capabilities. The prompt generation and prediction refinement modules are designed to work within the existing model, rather than fundamentally changing its behavior.

Additionally, the paper does not explore the transferability of the CraFT modules across different target tasks. It would be interesting to see how well the prompt generation and prediction refinement modules perform when applied to new tasks, rather than just fine-tuning on the original set of tasks.

Overall, the CraFT approach represents a significant advancement in the field of black-box model fine-tuning, and the authors' experimental results are quite impressive. This research could have important implications for making large, powerful VLMs more accessible and customizable for a wide range of applications.

Conclusion

This paper presents a novel approach called CraFT for fine-tuning pretrained vision-language models when their internal parameters are not accessible. CraFT comprises a prompt generation module and a prediction refinement module, which are jointly optimized using a collaborative training algorithm and an auxiliary loss function.

The key contribution of this work is demonstrating that effective fine-tuning of black-box VLMs is possible without requiring access to their internal parameters. This is a significant advancement, as model owners often choose to release their models as black boxes to protect their intellectual property.

The experimental results show that CraFT can achieve substantial performance gains on few-shot classification tasks, while using much less memory and training time compared to previous black-box fine-tuning methods. This makes CraFT a practical and efficient way to customize large, powerful VLMs for specific applications, even when their internal workings are not accessible.

Overall, this research has important implications for improving the accessibility and usability of state-of-the-art language-vision models in a wide range of real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Language Models as Black-Box Optimizers for Vision-Language Models

Shihong Liu, Zhiqiu Lin, Samuel Yu, Ryan Lee, Tiffany Ling, Deepak Pathak, Deva Ramanan

Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.

5/15/2024

cs.CL cs.CV cs.LG cs.MM

💬

CoLLaVO: Crayon Large Language and Vision mOdel

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

6/4/2024

cs.CV

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine

Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.

5/20/2024

cs.AI cs.CL cs.CV cs.LG

Supervised Fine-tuning in turn Improves Visual Foundation Models

Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.

4/12/2024

cs.CV cs.AI