Progressive Visual Prompt Learning with Contrastive Feature Re-formation

Read original: arXiv:2304.08386 - Published 7/2/2024 by Chen Xu, Yuhan Zhu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang

✨

Overview

This paper proposes a new method called "Progressive Visual Prompt" (ProVP) for adapting Vision-Language (V-L) models to downstream tasks.
Previous work has focused on text-based prompts, while visual prompts for V-L models have been limited and often underperform.
The authors introduce the ProVP structure to improve interactions between prompts at different layers and effectively propagate image embeddings to deeper layers.
They also propose a "contrastive feature re-formation" technique to prevent the prompted visual features from deviating too much from the fixed CLIP visual feature distribution, improving generalization.
The ProVP-Ref method (combining ProVP and contrastive feature re-formation) achieves state-of-the-art results on 7 out of 11 image benchmark datasets, outperforming previous prompt-based methods.

Plain English Explanation

Vision-Language (V-L) models are AI systems that can understand and process both images and text. Prompt learning has been proposed as an alternative to fine-tuning these models for specific tasks, as it is often more efficient.

Previous work has focused on using text-based prompts to adapt V-L models. However, the authors of this paper argue that visual prompts could also be very useful, as they could help the model better understand the visual information in the task. Unfortunately, existing visual prompt methods have either performed poorly or been unstable during training.

To address this, the researchers developed a new "Progressive Visual Prompt" (ProVP) structure. This ProVP approach helps the different layers of the V-L model work together more effectively, allowing the visual information to be better propagated through the model. The authors also show that their method behaves similarly to an "instance adaptive prompt" approach, which tailors the prompt to each individual input.

Additionally, the researchers introduced a "contrastive feature re-formation" technique. This helps ensure that the visual features extracted by the prompted model don't deviate too much from the original visual features the model was trained on. This improves the model's ability to generalize to new situations.

By combining the ProVP structure and the contrastive feature re-formation, the authors' ProVP-Ref method was able to outperform previous prompt-based approaches on a variety of image-based benchmark tasks. This suggests that visual prompts can be a powerful way to adapt V-L models, if done correctly.

Technical Explanation

The key technical contributions of this paper are the ProVP structure and the contrastive feature re-formation technique.

ProVP Structure:

The ProVP structure consists of a series of prompts that are applied at different layers of the V-L model.
These prompts are designed to improve the interactions between the visual and linguistic information as it flows through the model.
Specifically, the prompts help propagate the image embeddings to deeper layers of the model, allowing the visual information to be better integrated.
The authors show that this ProVP approach behaves similarly to an "instance adaptive prompt" method, where the prompt is tailored to each individual input.

Contrastive Feature Re-formation:

To prevent the prompted visual features from deviating too much from the original CLIP visual feature distribution, the authors introduce a contrastive feature re-formation technique.
This involves adding a contrastive loss term that encourages the prompted visual features to remain close to the original CLIP visual features.
This helps maintain the model's ability to generalize, as the visual features stay within the distribution the model was trained on.

The authors evaluate the ProVP-Ref method (combining ProVP and contrastive feature re-formation) on 11 image benchmark datasets. They show that it outperforms previous prompt-based methods, achieving state-of-the-art results on 7 out of the 11 datasets, in both few-shot and base-to-novel settings.

Critical Analysis

The paper makes a strong case for the potential of visual prompts in V-L models, showing that they can outperform text-based prompts when done correctly. The authors' innovations around the ProVP structure and contrastive feature re-formation appear to be effective at addressing the shortcomings of previous visual prompt methods.

However, the paper does not provide much insight into the limitations or failure cases of the ProVP-Ref method. It would be helpful to understand the types of tasks or datasets where it may still struggle, and what the authors believe are the remaining challenges in this area.

Additionally, the paper focuses solely on image-based benchmarks. It would be interesting to see how the ProVP-Ref method performs on other types of V-L tasks, such as text-to-image generation or multimodal reasoning. Expanding the evaluation to a broader range of V-L capabilities could further demonstrate the versatility and limitations of this approach.

Overall, this paper makes a valuable contribution by showing the potential of visual prompts and introducing effective techniques to leverage them. Continued research in this direction could lead to more robust and adaptable V-L models.

Conclusion

This paper presents a novel "Progressive Visual Prompt" (ProVP) approach for adapting Vision-Language (V-L) models to downstream tasks. The authors demonstrate that visual prompts can outperform text-based prompts when combined with their ProVP structure and a contrastive feature re-formation technique.

The ProVP-Ref method achieved state-of-the-art results on 7 out of 11 image benchmark datasets, suggesting that visual prompts can be a powerful tool for V-L model adaptation. This work highlights the importance of continued research into visual prompts and their potential to improve the flexibility and generalization of these multimodal AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

Progressive Visual Prompt Learning with Contrastive Feature Re-formation

Chen Xu, Yuhan Zhu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, Limin Wang

Prompt learning has been designed as an alternative to fine-tuning for adapting Vision-language (V-L) models to the downstream tasks. Previous works mainly focus on text prompt while visual prompt works are limited for V-L models. The existing visual prompt methods endure either mediocre performance or unstable training process, indicating the difficulty of visual prompt learning. In this paper, we propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers. More importantly, our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method. To alleviate generalization deterioration, we further propose a new contrastive feature re-formation, which prevents the serious deviation of the prompted visual feature from the fixed CLIP visual feature distribution. Combining both, our method (ProVP-Ref) is evaluated on 11 image benchmark datasets and achieves 7/11 state-of-theart results on both few-shot and base-to-novel settings. To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks. Meanwhile, it implies that our ProVP-Ref shows the best capability to adapt and to generalize.

7/2/2024

Revisiting Prompt Pretraining of Vision-Language Models

Zhenyuan Chen, Lingfeng Yang, Shuo Chen, Zhaowei Chen, Jiajun Liang, Xiang Li

Prompt learning is an effective method to customize Vision-Language Models (VLMs) for various downstream tasks, involving tuning very few parameters of input prompt tokens. Recently, prompt pretraining in large-scale dataset (e.g., ImageNet-21K) has played a crucial role in prompt learning for universal visual discrimination. However, we revisit and observe that the limited learnable prompts could face underfitting risks given the extensive images during prompt pretraining, simultaneously leading to poor generalization. To address the above issues, in this paper, we propose a general framework termed Revisiting Prompt Pretraining (RPP), which targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. For prompt structure, we break the restriction in common practice where query, key, and value vectors are derived from the shared learnable prompt token. Instead, we introduce unshared individual query, key, and value learnable prompts, thereby enhancing the model's fitting capacity through increased parameter diversity. For prompt supervision, we additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model. These soft labels yield more nuanced and general insights into the inter-class relationships, thereby endowing the pretraining process with better generalization ability. RPP produces a more resilient prompt initialization, enhancing its robust transferability across diverse visual recognition tasks. Experiments across various benchmarks consistently confirm the state-of-the-art (SOTA) performance of our pretrained prompts. Codes and models will be made available soon.

9/11/2024

Progressive Multi-modal Conditional Prompt Tuning

Xiaoyu Qiu, Hao Feng, Yuechen Wang, Wengang Zhou, Houqiang Li

Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting, which leverages VLMs as knowledge bases to extract information beneficial for downstream tasks. However, existing methods primarily employ uni-modal prompting, which only engages a uni-modal branch, failing to simultaneously adjust vision-language (V-L) features. Additionally, the one-pass forward pipeline in VLM encoding struggles to align V-L features that have a huge gap. Confronting these challenges, we propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT). ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information. It comprises an initialization and a multi-modal iterative evolution (MIE) module. Initialization is responsible for encoding image and text using a VLM, followed by a feature filter that selects text features similar to image. MIE then facilitates multi-modal prompting through class-conditional vision prompting, instance-conditional text prompting, and feature filtering. In each MIE iteration, vision prompts are obtained from the filtered text features via a vision generator, promoting image features to focus more on target object during vision prompting. The encoded image features are fed into a text generator to produce text prompts that are more robust to class shift. Thus, V-L features are progressively aligned, enabling advance from coarse to exact classifications. Extensive experiments are conducted in three settings to evaluate the efficacy of ProMPT. The results indicate that ProMPT outperforms existing methods on average across all settings, demonstrating its superior generalization.

4/19/2024

🏷️

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Jintao Rong, Hao Chen, Tianxiao Chen, Linlin Ou, Xinyi Yu, Yifan Liu

Prompt learning has become a popular approach for adapting large vision-language models, such as CLIP, to downstream tasks. Typically, prompt learning relies on a fixed prompt token or an input-conditional token to fit a small amount of data under full supervision. While this paradigm can generalize to a certain range of unseen classes, it may struggle when domain gap increases, such as in fine-grained classification and satellite image segmentation. To address this limitation, we propose Retrieval-enhanced Prompt learning (RePrompt), which introduces retrieval mechanisms to cache the knowledge representations from downstream tasks. we first construct a retrieval database from training examples, or from external examples when available. We then integrate this retrieval-enhanced mechanism into various stages of a simple prompt learning baseline. By referencing similar samples in the training set, the enhanced model is better able to adapt to new tasks with few samples. Our extensive experiments over 15 vision datasets, including 11 downstream tasks with few-shot setting and 4 domain generalization benchmarks, demonstrate that RePrompt achieves considerably improved performance. Our proposed approach provides a promising solution to the challenges faced by prompt learning when domain gap increases. The code and models will be available.

6/19/2024