PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval

Read original: arXiv:2408.01349 - Published 8/6/2024 by Yue Duan, Zhangxuan Gu, Zhenzhe Ying, Lei Qi, Changhua Meng, Yinghuan Shi

PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval

Overview

The paper introduces a novel approach called "PC2" (Pseudo-Classification based Pseudo-Captioning) for noisy correspondence learning in cross-modal retrieval tasks.
PC2 leverages pseudo-classification and pseudo-captioning to learn robust cross-modal representations from noisy image-text pairs.
The authors propose a realistic dataset contribution and demonstrate the effectiveness of PC2 on various cross-modal retrieval benchmarks.

Plain English Explanation

The paper describes a new technique called "PC2" that helps computer systems learn the relationship between images and text even when the connections between them are not always clear or accurate (noisy).

The key idea is to use "pseudo-classification" and "pseudo-captioning" to generate captions for images and labels for text, even when the original pairings are not perfect. This helps the system learn robust representations that can be used for tasks like retrieving relevant images based on text queries or generating video captions.

The authors also introduce a new, more realistic dataset to test these techniques, which they show helps improve the performance of their PC2 approach compared to existing methods.

Technical Explanation

The paper proposes a novel framework called PC2 (Pseudo-Classification based Pseudo-Captioning) for noisy correspondence learning in cross-modal retrieval tasks. The key components of PC2 are:

Pseudo-Classification: The model is trained to predict pseudo-class labels for text data, which helps capture semantic relationships even when the ground-truth image-text pairings are noisy.
Pseudo-Captioning: Conversely, the model generates pseudo-captions for images, leveraging the learned textual representations to provide additional supervisory signals.
Noisy Correspondence Learning: PC2 jointly optimizes the pseudo-classification and pseudo-captioning objectives, allowing the model to learn robust cross-modal representations from the noisy image-text pairs.

The authors also introduce a new, more realistic dataset for evaluating cross-modal retrieval systems, which captures the challenging nature of real-world noisy correspondence learning scenarios. Experiments on various benchmarks demonstrate the effectiveness of PC2 compared to state-of-the-art methods.

Critical Analysis

The paper makes a valuable contribution by addressing the important challenge of learning from noisy image-text correspondences, which is common in real-world applications. The proposed PC2 approach is a clever and principled solution that leverages pseudo-classification and pseudo-captioning to extract useful signals from the noisy data.

One potential limitation is that the performance of PC2 may still be sensitive to the quality of the noisy dataset. The authors mention that their new dataset is more realistic, but it would be interesting to see how the method performs on even more challenging or diverse noisy data.

Additionally, the paper does not provide much analysis on the failure cases or limitations of PC2. It would be helpful to understand the scenarios where the approach may struggle and areas for potential future improvements.

Conclusion

This paper introduces a novel framework called PC2 that addresses the challenge of noisy correspondence learning in cross-modal retrieval tasks. By combining pseudo-classification and pseudo-captioning, PC2 is able to learn robust cross-modal representations even when the original image-text pairings are not perfect.

The authors also contribute a more realistic dataset for evaluating such techniques, which is an important step towards developing more practical and deployable cross-modal systems. Overall, the PC2 approach represents a significant advancement in the field of cross-modal learning and has the potential to enable a wide range of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval

Yue Duan, Zhangxuan Gu, Zhenzhe Ying, Lei Qi, Changhua Meng, Yinghuan Shi

In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC$^2$) framework to address this challenge. PC$^2$ offers a threefold strategy: firstly, it establishes an auxiliary pseudo-classification task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC$^2$'s pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC$^2$ showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at https://github.com/alipay/PC2-NoiseofWeb.

8/6/2024

Disentangled Noisy Correspondence Learning

Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang

Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.

8/13/2024

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Paul Primus, Florian Schmid, Gerhard Widmer

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

8/22/2024

🖼️

CromSS: Cross-modal pre-training with noisy labels for remote sensing image segmentation

Chenying Liu, Conrad Albrecht, Yi Wang, Xiao Xiang Zhu

We study the potential of noisy labels y to pretrain semantic segmentation models in a multi-modal learning framework for geospatial applications. Specifically, we propose a novel Cross-modal Sample Selection method (CromSS) that utilizes the class distributions P^{(d)}(x,c) over pixels x and classes c modelled by multiple sensors/modalities d of a given geospatial scene. Consistency of predictions across sensors $d$ is jointly informed by the entropy of P^{(d)}(x,c). Noisy label sampling we determine by the confidence of each sensor d in the noisy class label, P^{(d)}(x,c=y(x)). To verify the performance of our approach, we conduct experiments with Sentinel-1 (radar) and Sentinel-2 (optical) satellite imagery from the globally-sampled SSL4EO-S12 dataset. We pair those scenes with 9-class noisy labels sourced from the Google Dynamic World project for pretraining. Transfer learning evaluations (downstream task) on the DFC2020 dataset confirm the effectiveness of the proposed method for remote sensing image segmentation.

5/3/2024