PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Read original: arXiv:2309.07760 - Published 9/17/2024 by Thi Minh Anh Pham, An Duc Nguyen, Cephas Svosve, Vasileios Argyriou, Georgios Tzimiropoulos

🗣️

Overview

Large pre-trained vision-language models like CLIP have great potential, but require manual prompt engineering to achieve optimal performance.
Recent work introduced prompt learning to avoid this challenge, but the learned prompts struggle to generalize to unseen classes.
This paper presents Prompt Learning with Reparameterization Encoder (PRE), a method to enhance the generalization of learnable prompts.

Plain English Explanation

Vision-language models like CLIP have shown impressive results on a variety of tasks. However, to get the best performance, you need to carefully choose the right "prompts" - short text descriptions that guide the model to the task at hand. This prompt engineering process requires domain expertise and is very time-consuming.

To avoid this, some researchers developed "prompt learning" - a way to automatically learn the best prompts. But the prompts learned this way don't generalize well to new, unseen classes of images that the model hasn't been trained on before.

This paper introduces a new method called Prompt Learning with Reparameterization Encoder (PRE) that enhances the generalization of the learnable prompts. Instead of directly optimizing the prompts, PRE uses an additional "prompt encoder" component to reparameterize the prompts in a way that helps the model better leverage task-specific knowledge from just a few training examples.

The authors show that PRE can achieve notable improvements over previous prompt learning approaches, especially when it comes to recognizing new, unseen classes of images. It does this while maintaining strong performance on the classes the model was originally trained on.

Technical Explanation

The key innovation in this paper is the Prompt Learning with Reparameterization Encoder (PRE) method. Instead of directly optimizing the prompts as in prior work, PRE employs a prompt encoder module to reparameterize the input prompt embeddings.

This reparameterization process allows the model to better explore and leverage task-specific knowledge from the few-shot training samples available for each class. The authors hypothesize that this enhanced exploration capability leads to prompts that generalize better to unseen classes, while still maintaining strong performance on the base classes the model was originally trained on.

Extensive experiments on 8 different benchmarks show that PRE achieves substantial improvements over prior prompt learning approaches. Specifically, PRE sees a 5.60% boost in average accuracy on new, unseen classes compared to the CoOp method, as well as a 3% improvement in harmonic mean across both new and base classes.

The authors attribute these gains to the reparameterization process, which allows the model to more effectively transfer knowledge from the few-shot training samples to novel classes. They also find that the training time for PRE is reasonable, making it an efficient and practical method for prompt learning.

Critical Analysis

The authors provide a thorough evaluation of PRE across multiple benchmarks, demonstrating its consistent advantages over previous prompt learning techniques. However, a few limitations and areas for future work are worth noting:

The paper focuses on improving generalization to unseen classes, but does not explore the model's performance on the original base classes in depth. Further analysis of the tradeoffs between base and new class accuracy would be insightful.
The experiments are limited to a few-shot setting (16 examples per class). It would be valuable to understand how PRE's performance scales with larger training set sizes.
The authors mention that PRE is an efficient method, but don't provide detailed comparisons of training time or computational cost compared to alternative approaches. More quantitative metrics in this regard would strengthen the claims.
While the results are promising, it's unclear how PRE would perform on more challenging, real-world computer vision tasks beyond the benchmarks presented. Validation on a broader range of applications would boost the external validity of the findings.

Overall, the Prompt Learning with Reparameterization Encoder (PRE) represents an interesting and potentially impactful contribution to the field of vision-language models and few-shot learning. Further research exploring the method's limitations and applicability to diverse real-world scenarios would be a valuable next step.

Conclusion

This paper introduces Prompt Learning with Reparameterization Encoder (PRE), a novel method for enhancing the generalization of learnable prompts in large pre-trained vision-language models. By incorporating a prompt encoder to reparameterize the input prompts, PRE is able to better leverage task-specific knowledge from few-shot training samples, leading to significant improvements in recognizing new, unseen classes of images.

The authors demonstrate the effectiveness of PRE through extensive experiments, showing gains of over 5% in average accuracy on new classes and 3% in overall harmonic mean compared to prior prompt learning approaches. These results highlight the potential of PRE to improve the practicality and deployability of powerful vision-language models in real-world applications where adapting to novel concepts is crucial.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

New!PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Thi Minh Anh Pham, An Duc Nguyen, Cephas Svosve, Vasileios Argyriou, Georgios Tzimiropoulos

Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.

9/17/2024

Revisiting Prompt Pretraining of Vision-Language Models

Zhenyuan Chen, Lingfeng Yang, Shuo Chen, Zhaowei Chen, Jiajun Liang, Xiang Li

Prompt learning is an effective method to customize Vision-Language Models (VLMs) for various downstream tasks, involving tuning very few parameters of input prompt tokens. Recently, prompt pretraining in large-scale dataset (e.g., ImageNet-21K) has played a crucial role in prompt learning for universal visual discrimination. However, we revisit and observe that the limited learnable prompts could face underfitting risks given the extensive images during prompt pretraining, simultaneously leading to poor generalization. To address the above issues, in this paper, we propose a general framework termed Revisiting Prompt Pretraining (RPP), which targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. For prompt structure, we break the restriction in common practice where query, key, and value vectors are derived from the shared learnable prompt token. Instead, we introduce unshared individual query, key, and value learnable prompts, thereby enhancing the model's fitting capacity through increased parameter diversity. For prompt supervision, we additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model. These soft labels yield more nuanced and general insights into the inter-class relationships, thereby endowing the pretraining process with better generalization ability. RPP produces a more resilient prompt initialization, enhancing its robust transferability across diverse visual recognition tasks. Experiments across various benchmarks consistently confirm the state-of-the-art (SOTA) performance of our pretrained prompts. Codes and models will be made available soon.

9/11/2024

🏷️

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Jintao Rong, Hao Chen, Tianxiao Chen, Linlin Ou, Xinyi Yu, Yifan Liu

Prompt learning has become a popular approach for adapting large vision-language models, such as CLIP, to downstream tasks. Typically, prompt learning relies on a fixed prompt token or an input-conditional token to fit a small amount of data under full supervision. While this paradigm can generalize to a certain range of unseen classes, it may struggle when domain gap increases, such as in fine-grained classification and satellite image segmentation. To address this limitation, we propose Retrieval-enhanced Prompt learning (RePrompt), which introduces retrieval mechanisms to cache the knowledge representations from downstream tasks. we first construct a retrieval database from training examples, or from external examples when available. We then integrate this retrieval-enhanced mechanism into various stages of a simple prompt learning baseline. By referencing similar samples in the training set, the enhanced model is better able to adapt to new tasks with few samples. Our extensive experiments over 15 vision datasets, including 11 downstream tasks with few-shot setting and 4 domain generalization benchmarks, demonstrate that RePrompt achieves considerably improved performance. Our proposed approach provides a promising solution to the challenges faced by prompt learning when domain gap increases. The code and models will be available.

6/19/2024

Semantic Residual Prompts for Continual Learning

Martin Menabue, Emanuele Frascaroli, Matteo Boschini, Enver Sangineto, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

Prompt-tuning methods for Continual Learning (CL) freeze a large pre-trained model and train a few parameter vectors termed prompts. Most of these methods organize these vectors in a pool of key-value pairs and use the input image as query to retrieve the prompts (values). However, as keys are learned while tasks progress, the prompting selection strategy is itself subject to catastrophic forgetting, an issue often overlooked by existing approaches. For instance, prompts introduced to accommodate new tasks might end up interfering with previously learned prompts. To make the selection strategy more stable, we leverage a foundation model (CLIP) to select our prompts within a two-level adaptation mechanism. Specifically, the first level leverages a standard textual prompt pool for the CLIP textual encoder, leading to stable class prototypes. The second level, instead, uses these prototypes along with the query image as keys to index a second pool. The retrieved prompts serve to adapt a pre-trained ViT, granting plasticity. In doing so, we also propose a novel residual mechanism to transfer CLIP semantics to the ViT layers. Through extensive analysis on established CL benchmarks, we show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Notably, our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model, as showcased by experiments on satellite imagery and medical datasets. The codebase is available at https://github.com/aimagelab/mammoth.

7/19/2024