Multi-Prompt with Depth Partitioned Cross-Modal Learning

2305.06221

Published 5/1/2024 by Yingjie Tian, Yiqi Wang, Xianda Guo, Zheng Zhu, Long Chen

🔍

Abstract

In recent years, soft prompt learning methods have been proposed to fine-tune large-scale vision-language pre-trained models for various downstream tasks. These methods typically combine learnable textual tokens with class tokens as input for models with frozen parameters. However, they often employ a single prompt to describe class contexts, failing to capture categories' diverse attributes adequately. This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts. Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture the hierarchical contextual depths of visual representations. Furthermore, to maximize the advantages of multi-prompt learning, we incorporate prior information from manually designed templates and learnable multi-prompts, thus improving the generalization capabilities of our approach. We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For instance, our method achieves a $79.28$ harmonic mean, averaged over 11 diverse image recognition datasets ($+7.62$ compared to CoOp), demonstrating significant competitiveness compared to state-of-the-art prompting methods.

Create account to get full access

Overview

Soft prompt learning methods have been used to fine-tune large-scale vision-language pre-trained models for various tasks.
These methods combine learnable textual tokens with class tokens as input, but often use a single prompt that fails to capture diverse category attributes.
This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that uses multiple prompts to better capture hierarchical visual representations.
The method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture contextual information at different levels.
Prior information from manually designed templates and learnable multi-prompts is incorporated to improve generalization capabilities.

Plain English Explanation

Large machine learning models trained on vast amounts of data have become incredibly powerful at tasks like image recognition and language understanding. Researchers have developed techniques to "fine-tune" these models for specific applications, rather than just using them as-is.

One popular approach is "soft prompt learning," where the model is given some additional input text alongside the main task data. This text acts as a hint or guidance to help the model perform better on the specific task.

However, most soft prompt learning methods use a single prompt, which may not be able to capture all the nuanced information needed for complex tasks. The Partitioned Multi-modal Prompt (PMPO) technique introduced in this study tries to address this by using multiple prompts instead of just one.

The key idea is to split up the visual information processed by the model into different "depths" or levels of abstraction. Then, each prompt is associated with a different depth, allowing the model to learn more comprehensive representations of the task at hand. Additionally, the researchers incorporate prior knowledge from human-designed templates to further improve the model's performance.

This multi-prompt approach has shown promising results on several challenging computer vision tasks, like recognizing new classes of objects, performing well on different datasets, and generalizing to new domains. By leveraging the power of large pre-trained models in a more nuanced way, the PMPO method represents an advance in the field of prompt-based fine-tuning for vision-language models.

Technical Explanation

The Partitioned Multi-modal Prompt (PMPO) technique introduced in this study extends the soft prompt learning approach by using multiple learnable prompts instead of a single prompt.

The key innovation is to divide the visual encoder depths of the pre-trained model and connect a separate learnable prompt to each of the separated visual depths. This allows the different prompts to capture hierarchical contextual information at various levels of the visual representations.

Furthermore, to maximize the benefits of multi-prompt learning, the researchers incorporate prior information from manually designed templates. These templates provide additional guidance to the model, improving its generalization capabilities compared to using only learnable prompts.

The researchers evaluate their method on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For example, on a benchmark of 11 diverse image recognition datasets, their PMPO method achieves a 79.28 harmonic mean score, which is a significant improvement over the 71.66 score of the CoOp baseline.

Critical Analysis

The Partitioned Multi-modal Prompt (PMPO) approach represents an interesting advance in prompt-based fine-tuning for vision-language models. By using multiple prompts associated with different levels of the visual encoder, the method is able to capture more nuanced representations of the task at hand.

However, the paper does not deeply explore the limitations of this approach. For example, it's unclear how the number of prompts and their association with specific visual depths should be determined, and whether this process can be automated or requires manual tuning.

Additionally, the incorporation of human-designed templates raises questions about the scalability and generalization of the method – if the templates need to be manually crafted for each new task or domain, it may limit the practical applicability of the approach.

Further research could also investigate the interpretability and explainability of the multi-prompt representations, as understanding how the different prompts contribute to the model's decision-making process could lead to important insights.

Conclusion

The Partitioned Multi-modal Prompt (PMPO) technique represents an innovative approach to fine-tuning large-scale vision-language models using multiple learnable prompts. By associating prompts with different levels of the visual encoder, the method is able to capture more comprehensive representations of the task at hand, leading to improved performance on challenging computer vision benchmarks.

While the paper demonstrates the effectiveness of this multi-prompt learning strategy, further research is needed to address potential limitations, such as the scalability of the template-based approach and the interpretability of the learned prompt representations. As the field of prompt-based fine-tuning continues to evolve, techniques like PMPO may play an increasingly important role in unlocking the full potential of large pre-trained models for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification

Zhenwei Wang, Qiule Sun, Bingbing Zhang, Pengfei Wang, Jianxin Zhang, Qiang Zhang

Few-shot learning has been successfully applied to medical image classification as only very few medical examples are available for training. Due to the challenging problem of limited number of annotated medical images, image representations should not be solely derived from a single image modality which is insufficient for characterizing concept classes. In this paper, we propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2. Besides image modality,PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes and facilitate few-shot learning across diverse modalities. To better explore the potential of prompt engineering, we empirically investigate five distinct prompt schemes under the new paradigm. Furthermore, linear probing in multi-modal models acts as a linear classification head taking as input only class token, which ignores completely merits of rich statistics inherent in high-level visual tokens. Thus, we alternatively perform a linear classification on feature distribution of visual tokens and class token simultaneously. To effectively mine such rich statistics, a global covariance pooling with efficient matrix power normalization is used to aggregate visual tokens. Then we study and combine two classification heads. One is shared for class token of image from vision encoder and prompt representation encoded by text encoder. The other is to classification on feature distribution of visual tokens from vision encoder. Extensive experiments on three medical datasets show that our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.

5/28/2024

cs.CV cs.LG

Multi-Prompting Decoder Helps Better Language Understanding

Zifeng Cheng, Zhaoling Chen, Zhiwei Jiang, Yafeng Yin, Shiping Ge, Yuliang Liu, Qing Gu

Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the output hidden states and class scores of the PLM. Despite the effectiveness of these methods, they only use a single prompt to query PLMs for decoding, leading to a heavy reliance on the quality of the adopted prompt. In this paper, we propose a simple yet effective Multi-Prompting Decoder (MPD) framework for MaaS adaptation. The core idea is to query PLMs with multiple different prompts for each sample, thereby obtaining multiple output hidden states and class scores for subsequent decoding. Such multi-prompting decoding paradigm can simultaneously mitigate reliance on the quality of a single prompt, alleviate the issue of data scarcity under the few-shot setting, and provide richer knowledge extracted from PLMs. Specifically, we propose two decoding strategies: multi-prompting decoding with optimal transport for hidden states and calibrated decoding for class scores. Extensive experiments demonstrate that our method achieves new state-of-the-art results on multiple natural language understanding datasets under the few-shot setting.

6/11/2024

cs.CL

POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models

Jianben He, Xingbo Wang, Shiyi Liu, Guande Wu, Claudio Silva, Huamin Qu

Large language models (LLMs) have exhibited impressive abilities for multimodal content comprehension and reasoning with proper prompting in zero- or few-shot settings. Despite the proliferation of interactive systems developed to support prompt engineering for LLMs across various tasks, most have primarily focused on textual or visual inputs, thus neglecting the complex interplay between modalities within multimodal inputs. This oversight hinders the development of effective prompts that guide model multimodal reasoning processes by fully exploiting the rich context provided by multiple modalities. In this paper, we present POEM, a visual analytics system to facilitate efficient prompt engineering for enhancing the multimodal reasoning performance of LLMs. The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts. Through diverse recommendations of demonstration examples and instructional principles, POEM supports users in iteratively crafting and refining prompts to better align and enhance model knowledge with human insights. The effectiveness and efficiency of our system are validated through two case studies and interviews with experts.

6/17/2024

cs.HC cs.AI

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, Kai Gao

Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.

6/7/2024

cs.MM cs.LG