Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

Read original: arXiv:2312.10692 - Published 9/4/2024 by Xiao Wang, Jiandong Jin, Chenglong Li, Jin Tang, Cheng Zhang, Wei Wang

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

Overview

Pedestrian attribute recognition is an important computer vision task that involves identifying various characteristics of people in images.
This paper proposes a novel approach that leverages CLIP, a pre-trained vision-language model, to perform pedestrian attribute recognition through prompt-based vision-language fusion.
The key contributions include a prompt-based learning framework, a multi-modal fusion strategy, and evaluation on a new large-scale pedestrian attribute dataset.

Plain English Explanation

The paper introduces a new way to identify different attributes of people in images, such as their clothing, accessories, or physical characteristics. The researchers use a powerful artificial intelligence model called CLIP that has been trained on a vast amount of image and text data.

Instead of training a model from scratch, the researchers leverage CLIP's pre-existing knowledge by providing it with "prompts" - short textual descriptions of the attributes they want to recognize. This prompt-based approach allows the model to quickly adapt to the pedestrian attribute recognition task without requiring extensive retraining.

The key innovation is how the researchers fuse the visual information from the images with the language information from the prompts. By intelligently combining these two modalities, the model can make more accurate and nuanced predictions about the attributes present in each image.

The researchers also introduce a new large-scale dataset of pedestrian images labeled with various attributes. They use this dataset to thoroughly evaluate their approach, demonstrating its effectiveness compared to other state-of-the-art methods.

Overall, this work presents an efficient and high-performing solution for pedestrian attribute recognition, which has important applications in areas like surveillance, fashion, and human-computer interaction.

Technical Explanation

The paper introduces a CLIP-based approach for pedestrian attribute recognition. CLIP is a pre-trained vision-language model that can encode both visual and textual information into a shared latent space, enabling powerful cross-modal reasoning.

The core of the method is a prompt-based learning framework. Instead of training a model from scratch, the researchers leverage CLIP's pre-existing knowledge by providing it with textual prompts that describe the desired pedestrian attributes. This allows the model to quickly adapt to the task without extensive retraining.

To fuse the visual and language modalities, the researchers propose a multi-modal fusion strategy. They extract visual features from the image using CLIP's image encoder, and language features from the prompts using CLIP's text encoder. These features are then combined through a series of neural network layers to produce the final attribute predictions.

The researchers also introduce a new large-scale pedestrian attribute dataset, which they use to thoroughly evaluate their approach. They compare their method to several state-of-the-art pedestrian attribute recognition techniques, demonstrating superior performance on a variety of metrics.

Critical Analysis

The paper presents a compelling approach to pedestrian attribute recognition that capitalizes on the strengths of pre-trained vision-language models like CLIP. The prompt-based learning framework and multi-modal fusion strategy are well-designed and appear to be effective in practice.

One potential limitation of the approach is its reliance on the quality and coverage of the pre-trained CLIP model. If CLIP's knowledge is biased or incomplete, this could introduce systematic errors into the pedestrian attribute predictions. Additionally, the researchers do not explore the sensitivity of their method to the choice of prompts, which could be an important consideration in real-world deployments.

Another area for further research could be investigating the interpretability and explainability of the model's decision-making process. Understanding how the visual and language features are combined to arrive at the final predictions could provide valuable insights and build trust in the system.

Despite these potential areas for improvement, the overall approach represents a significant advancement in pedestrian attribute recognition and demonstrates the power of leveraging pre-trained vision-language models for downstream tasks.

Conclusion

This paper presents a novel CLIP-based method for pedestrian attribute recognition that utilizes prompt-based learning and multi-modal fusion. The key innovations include a prompt-driven adaptation strategy and an effective way to combine visual and language features for accurate attribute prediction.

The researchers' evaluation on a new large-scale dataset showcases the effectiveness of their approach compared to other state-of-the-art techniques. This work has important implications for a variety of applications, including surveillance, fashion, and human-computer interaction, where reliable pedestrian attribute recognition is crucial.

Overall, this paper contributes a significant advancement in the field of pedestrian attribute recognition and demonstrates the power of leveraging pre-trained vision-language models for specialized computer vision tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

Xiao Wang, Jiandong Jin, Chenglong Li, Jin Tang, Cheng Zhang, Wei Wang

Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.

9/4/2024

👁️

Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition

Xiao Wang, Qian Zhu, Jiandong Jin, Jun Zhu, Futian Wang, Bo Jiang, Yaowei Wang, Yonghong Tian

Existing pedestrian attribute recognition (PAR) algorithms are mainly developed based on a static image, however, the performance is unreliable in challenging scenarios, such as heavy occlusion, motion blur, etc. In this work, we propose to understand human attributes using video frames that can fully use temporal information by fine-tuning a pre-trained multi-modal foundation model efficiently. Specifically, we formulate the video-based PAR as a vision-language fusion problem and adopt a pre-trained foundation model CLIP to extract the visual features. More importantly, we propose a novel spatiotemporal side-tuning strategy to achieve parameter-efficient optimization of the pre-trained vision foundation model. To better utilize the semantic information, we take the full attribute list that needs to be recognized as another input and transform the attribute words/phrases into the corresponding sentence via split, expand, and prompt operations. Then, the text encoder of CLIP is utilized for embedding processed attribute descriptions. The averaged visual tokens and text tokens are concatenated and fed into a fusion Transformer for multi-modal interactive learning. The enhanced tokens will be fed into a classification head for pedestrian attribute prediction. Extensive experiments on two large-scale video-based PAR datasets fully validated the effectiveness of our proposed framework. The source code of this paper is available at https://github.com/Event-AHU/OpenPAR.

4/30/2024

Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework

Jiandong Jin, Xiao Wang, Qian Zhu, Haiyang Wang, Chenglong Li

Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. The dataset and source code accompanying this paper will be made publicly available at url{https://github.com/Event-AHU/OpenPAR}.

8/20/2024

Multi-modal Attribute Prompting for Vision-Language Models

Xin Liu, Jiamin Wu, and Wenfei Yang, Xu Zhou, Tianzhu Zhang

Pre-trained Vision-Language Models (VLMs), like CLIP, exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios. Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics. This limitation hinders the model's ability to perceive fine-grained visual details and restricts its generalization ability to a broader range of unseen classes. To address this issue, we propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment. The proposed MAP enjoys several merits. First, we introduce learnable visual attribute prompts enhanced by textual attribute semantics to adaptively capture visual attributes for images from unknown categories, boosting fine-grained visual perception capabilities for CLIP. Second, the proposed attribute-level alignment complements the global alignment to enhance the robustness of cross-modal alignment for open-vocabulary objects. To our knowledge, this is the first work to establish cross-modal attribute-level alignment for CLIP-based few-shot adaptation. Extensive experimental results on 11 datasets demonstrate that our method performs favorably against state-of-the-art approaches.

7/12/2024