Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Read original: arXiv:2401.06825 - Published 7/30/2024 by Jiangming Shi, Xiangbo Yin, Yeyun Chen, Yachao Zhang, Zhizhong Zhang, Yuan Xie, Yanyun Qu

Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Overview

This paper presents a novel deep learning approach for unsupervised visible-infrared person re-identification (ReID).
The method, called Multi-Memory Matching (M3), leverages multiple memory banks to learn discriminative cross-modal features.
The model is trained in an end-to-end fashion without requiring any labeled data.

Plain English Explanation

Person re-identification (ReID) is the task of identifying the same person across different camera views. This is a challenging problem, especially when the camera views use different modalities, such as visible and infrared (IR) cameras.

The Multi-Memory Matching (M3) approach proposed in this paper addresses the unsupervised visible-infrared ReID task. The key idea is to use multiple memory banks to learn discriminative cross-modal features without any labeled training data.

The memory banks store and update feature representations of the visible and IR images. The model learns to match these feature representations across the two modalities, allowing it to identify the same person in visible and IR images, even though they may look quite different.

The advantage of this approach is that it can be trained end-to-end in an unsupervised manner, without requiring expensive and time-consuming manual labeling of training data. This makes it more practical for real-world applications where labeled data may be scarce or difficult to obtain.

Technical Explanation

The M3 model consists of two encoders, one for visible images and one for IR images, that share parameters. These encoders map the input images to feature representations stored in respective memory banks.

The model is trained using a multi-task learning approach, with three main objectives:

Memory Matching: The model learns to match the feature representations in the visible and IR memory banks, encouraging the encoders to learn common cross-modal features.
Memory Diversity: The model also learns to diversify the feature representations in each memory bank, ensuring that the stored features are distinctive and informative.
Memory Compactness: The model encourages the feature representations in each memory bank to be compact, making the stored features more discriminative.

The unsupervised training process involves iteratively updating the memory banks and the encoder networks to optimize these three objectives. This allows the model to learn powerful cross-modal features without any labeled training data.

Critical Analysis

The M3 model demonstrates impressive performance on several unsupervised visible-infrared ReID benchmarks, outperforming previous state-of-the-art methods. However, the authors acknowledge that the model may still struggle in challenging real-world scenarios, such as when there are large variations in illumination, viewpoint, or occlusion.

Additionally, the paper does not provide a detailed analysis of the memory bank dynamics and their influence on the learned features. It would be interesting to see a more in-depth investigation of how the memory banks evolve during training and how they contribute to the model's performance.

Overall, the M3 approach is a promising step towards more practical and scalable unsupervised visible-infrared ReID solutions, but further research is needed to address its limitations and understand its inner workings in greater detail.

Conclusion

The Multi-Memory Matching (M3) model presents an effective unsupervised approach for visible-infrared person re-identification. By leveraging multiple memory banks to learn discriminative cross-modal features, the model can achieve state-of-the-art performance without requiring any labeled training data.

This is a significant advancement in the field of ReID, as it can enable the deployment of these systems in real-world scenarios where labeled data is scarce or difficult to obtain. The M3 approach paves the way for more scalable and practical solutions for cross-modal person identification, with potential applications in security, surveillance, and other areas where robust person tracking is crucial.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Jiangming Shi, Xiangbo Yin, Yeyun Chen, Yachao Zhang, Zhizhong Zhang, Yuan Xie, Yanyun Qu

Unsupervised visible-infrared person re-identification (USL-VI-ReID) is a promising yet challenging retrieval task. The key challenges in USL-VI-ReID are to effectively generate pseudo-labels and establish pseudo-label correspondences across modalities without relying on any prior annotations. Recently, clustered pseudo-label methods have gained more attention in USL-VI-ReID. However, previous methods fell short of fully exploiting the individual nuances, as they simply utilized a single memory that represented an identity to establish cross-modality correspondences, resulting in ambiguous cross-modality correspondences. To address the problem, we propose a Multi-Memory Matching (MMM) framework for USL-VI-ReID. We first design a Cross-Modality Clustering (CMC) module to generate the pseudo-labels through clustering together both two modality samples. To associate cross-modality clustered pseudo-labels, we design a Multi-Memory Learning and Matching (MMLM) module, ensuring that optimization explicitly focuses on the nuances of individual perspectives and establishes reliable cross-modality correspondences. Finally, we design a Soft Cluster-level Alignment (SCA) module to narrow the modality gap while mitigating the effect of noise pseudo-labels through a soft many-to-many alignment strategy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the reliability of the established cross-modality correspondences and the effectiveness of our MMM. The source codes will be released.

7/30/2024

🤷

Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID

De Cheng, Lingfeng He, Nannan Wang, Shizhou Zhang, Zhen Wang, Xinbo Gao

Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to match pedestrian images of the same identity from different modalities without annotations. Existing works mainly focus on alleviating the modality gap by aligning instance-level features of the unlabeled samples. However, the relationships between cross-modality clusters are not well explored. To this end, we propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters. Specifically, we design a Many-to-many Bilateral Cross-Modality Cluster Matching (MBCCM) algorithm through optimizing the maximum matching problem in a bipartite graph. Then, the matched pairwise clusters utilize shared visible and infrared pseudo-labels during the model training. Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level. Meanwhile, the cross-modality Consistency Constraint (CC) is proposed to explicitly reduce the large modality discrepancy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art approaches by a large margin of 8.76% mAP on average.

5/28/2024

Learning Commonality, Divergence and Variety for Unsupervised Visible-Infrared Person Re-identification

Jiangming Shi, Xiangbo Yin, Yaoxing Wang, Xiaofeng Liu, Yuan Xie, Yanyun Qu

Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match specified people in infrared images to visible images without annotation, and vice versa. USVI-ReID is a challenging yet under-explored task. Most existing methods address the USVI-ReID problem using cluster-based contrastive learning, which simply employs the cluster center as a representation of a person. However, the cluster center primarily focuses on shared information, overlooking disparity. To address the problem, we propose a Progressive Contrastive Learning with Multi-Prototype (PCLMP) method for USVI-ReID. In brief, we first generate the hard prototype by selecting the sample with the maximum distance from the cluster center. This hard prototype is used in the contrastive loss to emphasize disparity. Additionally, instead of rigidly aligning query images to a specific prototype, we generate the dynamic prototype by randomly picking samples within a cluster. This dynamic prototype is used to retain the natural variety of features while reducing instability in the simultaneous learning of both common and disparate information. Finally, we introduce a progressive learning strategy to gradually shift the model's attention towards hard samples, avoiding cluster deterioration. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets validate the effectiveness of the proposed method. PCLMP outperforms the existing state-of-the-art method with an average mAP improvement of 3.9%. The source codes will be released.

5/28/2024

Unsupervised Visible-Infrared ReID via Pseudo-label Correction and Modality-level Alignment

Yexin Liu, Weiming Zhang, Athanasios V. Vasilakos, Lin Wang

Unsupervised visible-infrared person re-identification (UVI-ReID) has recently gained great attention due to its potential for enhancing human detection in diverse environments without labeling. Previous methods utilize intra-modality clustering and cross-modality feature matching to achieve UVI-ReID. However, there exist two challenges: 1) noisy pseudo labels might be generated in the clustering process, and 2) the cross-modality feature alignment via matching the marginal distribution of visible and infrared modalities may misalign the different identities from two modalities. In this paper, we first conduct a theoretic analysis where an interpretable generalization upper bound is introduced. Based on the analysis, we then propose a novel unsupervised cross-modality person re-identification framework (PRAISE). Specifically, to address the first challenge, we propose a pseudo-label correction strategy that utilizes a Beta Mixture Model to predict the probability of mis-clustering based network's memory effect and rectifies the correspondence by adding a perceptual term to contrastive learning. Next, we introduce a modality-level alignment strategy that generates paired visible-infrared latent features and reduces the modality gap by aligning the labeling function of visible and infrared features to learn identity discriminative and modality-invariant features. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance than the unsupervised visible-ReID methods.

4/11/2024