PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition

Read original: arXiv:2404.19168 - Published 5/1/2024 by Dongyun Lin, Yi Cheng, Shangbo Mao, Aiyuan Guo, Yiqun Li

🌐

Overview

The paper explores using large vision-language models like CLIP to address zero-shot and few-shot 3D shape recognition tasks.
The key challenge is generating a discriminative descriptor of 3D shapes from multiple view images, without explicit training (zero-shot) or with limited data (few-shot).
The paper proposes a Prompt-Enhanced View Aggregation Network (PEVA-Net) to tackle both zero-shot and few-shot 3D shape recognition simultaneously.

Plain English Explanation

The paper is about using powerful AI models that can understand both images and language, like CLIP, to recognize 3D shapes. This is a challenging task because the 3D shape information is spread across multiple 2D images.

The researchers found that the techniques that work well for zero-shot (no training) 3D shape recognition can also help improve few-shot (limited training) 3D shape recognition. Their PEVA-Net model uses language "prompts" to better aggregate the information from multiple 2D views of the 3D shape. This allows it to accurately recognize 3D shapes even without much training data.

By leveraging the zero-shot recognition capabilities to guide the few-shot learning, the model can learn very effective 3D shape descriptors from limited data. This is like an experienced person helping a beginner learn a new skill faster.

Technical Explanation

The paper proposes the Prompt-Enhanced View Aggregation Network (PEVA-Net) to address both zero-shot and few-shot 3D shape recognition tasks.

For zero-shot recognition, PEVA-Net uses language "prompts" derived from the candidate shape categories to enhance the aggregation of multiple 2D view features into a discriminative 3D shape descriptor. This allows effective zero-shot recognition without any explicit training.

For few-shot recognition, PEVA-Net first uses a transformer encoder to aggregate the 2D view features into a global 3D shape descriptor. To train this encoder, in addition to the main classification loss, the model is guided by a feature distillation loss that treats the zero-shot descriptor as a target. This "self-distillation" scheme significantly improves the few-shot learning performance.

The key insight is that the techniques that enable effective zero-shot inference can also be leveraged to enhance few-shot learning for 3D shape recognition. By combining these two complementary approaches, PEVA-Net can achieve strong performance in both zero-shot and few-shot scenarios.

Critical Analysis

The paper presents a compelling approach to leveraging large vision-language models for 3D shape recognition in low-data regimes. The proposed PEVA-Net model demonstrates the synergies between zero-shot and few-shot learning, which is an interesting direction for further exploration.

However, the paper does not provide a detailed analysis of the limitations or failure modes of the approach. For example, it would be useful to understand how the model's performance scales with the number of training examples or the complexity of the 3D shapes. Additionally, the paper does not compare PEVA-Net's performance to other zero-shot or few-shot 3D shape recognition methods, making it difficult to assess the relative merits of the proposed approach.

Further research could also explore the robustness of PEVA-Net to variations in the 3D shape representations, such as changes in viewpoint, occlusions, or noise. Investigating the unsupervised enhancement of the zero-shot descriptors could also lead to additional performance gains.

Overall, the paper presents a promising direction for leveraging large vision-language models to address the challenging problem of 3D shape recognition, and the proposed PEVA-Net model offers an interesting approach to combining zero-shot and few-shot learning techniques.

Conclusion

The paper explores the use of large vision-language models, such as CLIP, to tackle the problem of zero-shot and few-shot 3D shape recognition. The key insight is that the techniques enabling effective zero-shot inference can also be leveraged to enhance few-shot learning performance.

The proposed PEVA-Net model demonstrates how language "prompts" can be used to guide the aggregation of multiple 2D view features into a discriminative 3D shape descriptor, enabling accurate zero-shot recognition. Furthermore, the self-distillation scheme that uses the zero-shot descriptor as a target can significantly improve the few-shot learning efficacy.

By combining these complementary zero-shot and few-shot learning approaches, PEVA-Net offers a promising solution to the challenge of 3D shape recognition with limited training data. This research highlights the potential of large vision-language models to address complex visual tasks in low-resource scenarios, which could have important implications for practical applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition

Dongyun Lin, Yi Cheng, Shangbo Mao, Aiyuan Guo, Yiqun Li

Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by multiple view images under the scenarios of either without explicit training (zero-shot 3D shape recognition) or training with a limited number of data (few-shot 3D shape recognition). We analyze that both tasks are relevant and can be considered simultaneously. Specifically, leveraging the descriptor which is effective for zero-shot inference to guide the tuning of the aggregated descriptor under the few-shot training can significantly improve the few-shot learning efficacy. Hence, we propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition. Under the zero-shot scenario, we propose to leverage the prompts built up from candidate categories to enhance the aggregation process of multiple view-associated visual features. The resulting aggregated feature serves for effective zero-shot recognition of the 3D shapes. Under the few-shot scenario, we first exploit a transformer encoder to aggregate the view-associated visual features into a global descriptor. To tune the encoder, together with the main classification loss, we propose a self-distillation scheme via a feature distillation loss by treating the zero-shot descriptor as the guidance signal for the few-shot descriptor. This scheme can significantly enhance the few-shot learning efficacy.

5/1/2024

👁️

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

Dan Song, Xinwei Fu, Ning Liu, Weizhi Nie, Wenhui Li, Lanjun Wang, You Yang, Anan Liu

Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44%, 91.51%, and 66.17% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.

9/12/2024

Prompt-based Visual Alignment for Zero-shot Policy Transfer

Haihan Gao, Rui Zhang, Qi Yi, Hantao Yao, Haochen Li, Jiaming Guo, Shaohui Peng, Yunkai Gao, QiCheng Wang, Xing Hu, Yuanbo Wen, Zihao Zhang, Zidong Du, Ling Li, Qi Guo, Yunji Chen

Overfitting in RL has become one of the main obstacles to applications in reinforcement learning(RL). Existing methods do not provide explicit semantic constrain for the feature extractor, hindering the agent from learning a unified cross-domain representation and resulting in performance degradation on unseen domains. Besides, abundant data from multiple domains are needed. To address these issues, in this work, we propose prompt-based visual alignment (PVA), a robust framework to mitigate the detrimental domain bias in the image for zero-shot policy transfer. Inspired that Visual-Language Model (VLM) can serve as a bridge to connect both text space and image space, we leverage the semantic information contained in a text sequence as an explicit constraint to train a visual aligner. Thus, the visual aligner can map images from multiple domains to a unified domain and achieve good generalization performance. To better depict semantic information, prompt tuning is applied to learn a sequence of learnable tokens. With explicit constraints of semantic information, PVA can learn unified cross-domain representation under limited access to cross-domain data and achieves great zero-shot generalization ability in unseen domains. We verify PVA on a vision-based autonomous driving task with CARLA simulator. Experiments show that the agent generalizes well on unseen domains under limited access to multi-domain data.

6/6/2024

🛸

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng

Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: https://github.com/HVision-NKU/Cascade-CLIP

6/7/2024