Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations

Read original: arXiv:2306.02092 - Published 9/4/2024 by Xu Zhang, Zhedong Zheng, Linchao Zhu, Yi Yang

🖼️

Overview

Composed image retrieval allows users to search for images using reference images and captions that describe their intent.
Current methods face an issue called triplet ambiguity, where multiple visually dissimilar images can be matched to the same reference pair (image + caption).
To address this, the paper proposes the Consensus Network (Css-Net), which uses a consensus module with diverse compositors and a Kullback-Leibler divergence loss to promote consensual outputs.
Css-Net demonstrates significant improvements on benchmark datasets, particularly in recall metrics.

Plain English Explanation

Composed image retrieval is a way for users to search for images using a reference image and a caption that describes what they're looking for. This goes beyond traditional content-based image retrieval, which only uses the image itself.

The key challenge the paper addresses is "triplet ambiguity." This refers to a problem where multiple visually different images can be matched to the same reference pair (image + caption). This is due to limitations in how the captions represent the content, leading to many "noisy" triplets (reference image, caption, target image).

To solve this, the paper proposes the Consensus Network (Css-Net). This system has two main components:

A consensus module with four diverse "compositors" that each generate their own unique image-text embeddings. This helps capture different perspectives and mitigates reliance on any single, potentially biased compositor.
A Kullback-Leibler divergence loss that encourages the compositors to learn how to work together and produce more consensual outputs.

During evaluation, the decisions of the four compositors are combined using a weighting scheme to enhance overall agreement.

Css-Net demonstrates significant improvements on benchmark datasets, particularly FashionIQ. It achieves notable gains in recall metrics, suggesting it is better able to address the fundamental limitations of existing methods.

Technical Explanation

The paper proposes the Consensus Network (Css-Net) to address the issue of triplet ambiguity in composed image retrieval systems.

Triplet ambiguity refers to the semantic ambiguity that can arise between the reference image, the relative caption, and the target image. This is primarily due to the limited representation of the annotated text, resulting in many noisy triplets where multiple visually dissimilar candidate images can be matched to an identical reference pair.

To mitigate this challenge, Css-Net comprises two core components:

Consensus Module: This module includes four diverse compositors, each generating distinct image-text embeddings. This fosters complementary feature extraction and reduces dependence on any single, potentially biased compositor.
Kullback-Leibler Divergence Loss: This loss function encourages the compositors to learn how to interact and promote consensual outputs. By minimizing the Kullback-Leibler divergence between the compositors' outputs, the system learns to produce more aligned representations.

During evaluation, the decisions of the four compositors are combined through a weighting scheme, enhancing the overall agreement and robustness of the system.

Experiments on benchmark datasets, particularly FashionIQ, demonstrate that Css-Net achieves significant improvements, especially in recall metrics. Specifically, it shows a 2.77% increase in R@10 and a 6.67% boost in R@50, indicating its effectiveness in addressing the fundamental limitations of existing methods.

Critical Analysis

The paper does a commendable job of identifying and addressing the issue of triplet ambiguity, which is a significant challenge in composed image retrieval systems. The proposed Consensus Network (Css-Net) offers a novel solution by leveraging a consensus module with diverse compositors and a Kullback-Leibler divergence loss to promote consensual outputs.

One potential limitation of the approach is the reliance on four separate compositors, which may increase the computational complexity and training time of the system. It would be interesting to explore whether a similar level of performance can be achieved with a more efficient architecture.

Additionally, the paper does not provide much insight into the specific strengths and weaknesses of the individual compositors within the consensus module. Understanding how each compositor contributes to the final result could lead to further improvements in the system design.

Another area for further research could be investigating the generalizability of Css-Net to other domains beyond fashion, such as general composed image retrieval or composed video retrieval. Exploring the system's performance on diverse datasets and use cases could shed light on its broader applicability.

Conclusion

The Consensus Network (Css-Net) proposed in this paper represents a significant advancement in addressing the triplet ambiguity challenge in composed image retrieval systems. By leveraging a consensus module with diverse compositors and a Kullback-Leibler divergence loss, Css-Net demonstrates marked improvements in recall metrics, particularly on the FashionIQ dataset.

This research highlights the importance of addressing fundamental limitations in existing methods and the potential of collaborative approaches to enhance the robustness and performance of image-text retrieval systems. As the field of composed image retrieval continues to evolve, the insights and techniques introduced in this paper can serve as a valuable foundation for further advancements.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations

Xu Zhang, Zhedong Zheng, Linchao Zhu, Yi Yang

Composed image retrieval extends content-based image retrieval systems by enabling users to search using reference images and captions that describe their intention. Despite great progress in developing image-text compositors to extract discriminative visual-linguistic features, we identify a hitherto overlooked issue, triplet ambiguity, which impedes robust feature extraction. Triplet ambiguity refers to a type of semantic ambiguity that arises between the reference image, the relative caption, and the target image. It is mainly due to the limited representation of the annotated text, resulting in many noisy triplets where multiple visually dissimilar candidate images can be matched to an identical reference pair (i.e., a reference image + a relative caption). To address this challenge, we propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings, fostering complementary feature extraction and mitigating dependence on any single, potentially biased compositor; (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions to promote consensual outputs. During evaluation, the decisions of the four compositors are combined through a weighting scheme, enhancing overall agreement. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its competitiveness in addressing the fundamental limitations of existing methods.

9/4/2024

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, Xueming Qian

Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.

5/31/2024

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Zhangchi Feng, Richong Zhang, Zhijie Nie

The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our method also performs well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario. Our code and data are released at https://github.com/BUAADreamer/SPN4CIR.

8/9/2024

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song

Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query, i.e., a reference image and its corresponding modification text. While previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between time-consuming annotation cost and retrieval performance, recent researchers introduced the task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network based on pretrained CLIP model to realize it. Despite its promising performance, the approach suffers from two key limitations: insufficient multimodal query composition training and indiscriminative training triplet selection. To address these two limitations, in this work, we propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we employ a masked training strategy and advanced image caption generator to construct pseudo triplets from pure image data to enable the model to acquire primary knowledge related to multimodal query composition. In the second stage, based on active learning, we design a pseudo modification text-based query-target distance metric to evaluate the challenging score for each unlabeled sample. Meanwhile, we propose a robust top range-based random sampling strategy according to the 3-$sigma$ rule in statistics, to sample the challenging samples for fine-tuning the pretrained model. Notably, our scheme is plug-and-play and compatible with any existing supervised CIR models. We tested our scheme across three backbones on three public datasets (i.e., FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%, 25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.

7/9/2024