Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

Read original: arXiv:2409.10389 - Published 9/17/2024 by Hanbo Bi, Yingchao Feng, Wenhui Diao, Peijin Wang, Yongqiang Mao, Kun Fu, Hongqi Wang, Xian Sun

Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

Overview

This paper proposes a novel "Prompt-and-Transfer" method for few-shot segmentation.
The method dynamically enhances class-specific prompts to improve the few-shot segmentation performance.
It outperforms state-of-the-art few-shot segmentation methods on multiple benchmarks.

Plain English Explanation

The paper introduces a new technique called "Prompt-and-Transfer" for few-shot segmentation. In few-shot segmentation, the goal is to train a model to segment images into different objects or regions, but the model only has access to a small number of examples (the "few-shot") for each object class during training.

The key idea of the "Prompt-and-Transfer" method is to dynamically enhance the "prompts" - brief text descriptions or visual cues - that the model uses to identify each object class. By making these prompts more informative and tailored to each class, the model can better learn to recognize the objects in the few-shot examples and apply that knowledge to new images.

The method works by first training the model on a set of base classes with many training examples. It then learns how to automatically generate and refine the prompts for each class based on the few-shot examples provided during testing. This "prompt-and-transfer" approach allows the model to quickly adapt to new classes without requiring a full retraining.

The paper shows that this technique outperforms other state-of-the-art few-shot segmentation methods on several benchmarks, demonstrating its effectiveness at few-shot learning.

Technical Explanation

The key technical components of the "Prompt-and-Transfer" method are:

Base Training: The model is first trained on a set of base classes with abundant training data. This allows the model to learn general visual and semantic representations.
Prompt Encoder: A neural network is used to encode the class name or other textual/visual prompts into a feature representation. This prompt encoding is dynamically updated during few-shot adaptation.
Prompt-and-Transfer Adaptation: During few-shot testing, the model takes the prompt encodings and the few-shot examples as input. It learns to dynamically enhance the prompts to better match the visual characteristics of the few-shot classes.
Segmentation Head: The enhanced prompts are then used by the segmentation head to produce the final per-pixel predictions for the target image.

The key insight is that by dynamically updating the prompts based on the few-shot examples, the model can better leverage its base training to adapt to new classes. This outperforms approaches that use static or generic prompts.

Critical Analysis

The paper makes a strong contribution to the field of few-shot segmentation. The "Prompt-and-Transfer" technique is a clever way to combine the benefits of base training and few-shot adaptation.

One potential limitation is that the method still requires access to base classes with many training examples. An interesting direction for future work could be to explore few-shot strategies that can learn effective prompts from even fewer examples.

Additionally, the paper focuses on segmentation, but the prompt-based approach could potentially be extended to other few-shot computer vision tasks like classification or object detection. Exploring these extensions could further demonstrate the versatility of the method.

Overall, the "Prompt-and-Transfer" technique represents an important advance in few-shot segmentation, with promising avenues for future research and real-world applications.

Conclusion

This paper introduces a novel "Prompt-and-Transfer" method for few-shot semantic segmentation. By dynamically enhancing class-specific prompts, the model can better leverage its base training to adapt to new classes with only a few examples.

The technique outperforms other state-of-the-art few-shot segmentation approaches, demonstrating its effectiveness at few-shot learning. While the paper focuses on segmentation, the prompt-based approach could potentially be extended to other few-shot computer vision tasks, making it a valuable contribution to the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

Hanbo Bi, Yingchao Feng, Wenhui Diao, Peijin Wang, Yongqiang Mao, Kun Fu, Hongqi Wang, Xian Sun

For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.

9/17/2024

TAVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation

Jiaqi Yang, Ye Huang, Xiangjian He, Linlin Shen, Guoping Qiu

Under the backdrop of large-scale pre-training, large visual models (LVM) have demonstrated significant potential in image understanding. The recent emergence of the Segment Anything Model (SAM) has brought a qualitative shift in the field of image segmentation, supporting flexible interactive cues and strong learning capabilities. However, its performance often falls short in cross-domain and few-shot applications. Transferring prior knowledge from foundation models to new applications while preserving learning capabilities is worth exploring. This work proposes a task-adaptive prompt framework based on SAM, a new paradigm for Cross-dominan few-shot segmentation (CD-FSS). First, a Multi-level Feature Fusion (MFF) was used for integrated feature extraction. Besides, an additional Class Domain Task-Adaptive Auto-Prompt (CDTAP) module was combined with the segmentation branch for class-domain agnostic feature extraction and high-quality learnable prompt production. This significant advancement uses a unique generative approach to prompts alongside a comprehensive model structure and specialized prototype computation. While ensuring that the prior knowledge of SAM is not discarded, the new branch disentangles category and domain information through prototypes, guiding it in adapting the CD-FSS. We have achieved the best results on three benchmarks compared to the recent state-of-the-art (SOTA) methods. Comprehensive experiments showed that after task-specific and weighted guidance, the abundant feature information of SAM can be better learned for CD-FSS.

9/10/2024

APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentatio

Weizhao He, Yang Zhang, Wei Zhuo, Linlin Shen, Jiaqi Yang, Songhe Deng, Liang Sun

Few-shot semantic segmentation (FSS) endeavors to segment unseen classes with only a few labeled samples. Current FSS methods are commonly built on the assumption that their training and application scenarios share similar domains, and their performances degrade significantly while applied to a distinct domain. To this end, we propose to leverage the cutting-edge foundation model, the Segment Anything Model (SAM), for generalization enhancement. The SAM however performs unsatisfactorily on domains that are distinct from its training data, which primarily comprise natural scene images, and it does not support automatic segmentation of specific semantics due to its interactive prompting mechanism. In our work, we introduce APSeg, a novel auto-prompt network for cross-domain few-shot semantic segmentation (CD-FSS), which is designed to be auto-prompted for guiding cross-domain segmentation. Specifically, we propose a Dual Prototype Anchor Transformation (DPAT) module that fuses pseudo query prototypes extracted based on cycle-consistency with support prototypes, allowing features to be transformed into a more stable domain-agnostic space. Additionally, a Meta Prompt Generator (MPG) module is introduced to automatically generate prompt embeddings, eliminating the need for manual visual prompts. We build an efficient model which can be applied directly to target domains without fine-tuning. Extensive experiments on four cross-domain datasets show that our model outperforms the state-of-the-art CD-FSS method by 5.24% and 3.10% in average accuracy on 1-shot and 5-shot settings, respectively.

6/14/2024

Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

Mir Rayat Imtiaz Hossain, Mennatullah Siam, Leonid Sigal, James J. Little

The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally, we introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall, this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or transduction). Furthermore, test-time optimization leveraging unlabelled test data can be used to improve the prompts, which we refer to as transductive prompt tuning.

4/19/2024