Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Read original: arXiv:2404.11317 - Published 8/9/2024 by Zhangchi Feng, Richong Zhang, Zhijie Nie

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Overview

This paper proposes a contrastive learning approach to improve composed image retrieval, which involves finding relevant images that match a combination of visual elements.
The key ideas are scaling the positive and negative samples during training to better capture the relationships between composed images.
The authors conduct extensive experiments on multiple datasets to demonstrate the effectiveness of their approach compared to prior methods.

Plain English Explanation

The paper focuses on the task of composed image retrieval, where the goal is to find images that match a combination of visual elements, such as a person wearing a certain type of clothing in a particular setting. This is a challenging problem because the relationships between the individual visual elements need to be learned effectively.

The authors propose a contrastive learning approach to address this challenge. Contrastive learning is a technique that encourages the model to learn representations that bring together similar images (e.g., images with the same visual elements) and push apart dissimilar images. The key innovation in this paper is scaling the positive and negative samples during training. This means that the model is trained to pay more attention to the relationships between the composed visual elements, rather than just recognizing the individual elements.

Through extensive experiments on multiple datasets, the authors show that their approach outperforms previous methods for composed image retrieval. This suggests that the scaling of positive and negative samples during contrastive learning can be an effective way to capture the complex relationships between visual elements in composed images.

Technical Explanation

The paper proposes a contrastive learning approach for composed image retrieval, called Scaling Positives and Negatives (SPN). The core idea is to scale the positive and negative samples during the contrastive loss computation to better capture the relationships between the individual visual elements in a composed image.

Specifically, the authors use a contrastive loss function that encourages the model to bring together images with the same composed visual elements (positives) and push apart images with different composed visual elements (negatives). The scaling is applied to both the positive and negative samples, with the goal of increasing the importance of the relationships between the elements compared to just recognizing the individual elements.

The authors experiment with several different scaling strategies, including linear scaling, exponential scaling, and log-based scaling, and find that the log-based scaling approach performs the best. They also explore various ways of constructing the positive and negative samples, such as using query-level and instance-level comparisons.

The proposed SPN approach is evaluated on multiple datasets for composed image retrieval, including Flickr-Centered Composition Dataset (FCCD), Composed Image Retrieval (CIR) Dataset, and Flickr30k Entities. The results show that SPN consistently outperforms previous state-of-the-art methods, demonstrating the effectiveness of the proposed contrastive learning approach with scaled positive and negative samples.

Critical Analysis

The paper presents a well-designed and thorough study on improving composed image retrieval using contrastive learning with scaled positive and negative samples. The authors have carefully considered various scaling strategies and sample construction methods, and their experiments provide strong evidence for the effectiveness of the proposed SPN approach.

One potential limitation of the work is that it primarily focuses on improving the retrieval performance, without delving deep into the interpretability or explainability of the learned representations. It would be interesting to see how the scaled positive and negative samples impact the model's understanding of the relationships between visual elements in composed images.

Additionally, the paper does not discuss the computational complexity or inference time of the SPN approach compared to previous methods. This information would be useful for understanding the practical implications of deploying the proposed technique in real-world applications.

Overall, the paper makes a valuable contribution to the field of composed image retrieval and provides a strong foundation for further research in this area. The authors' insights on the importance of scaling positive and negative samples during contrastive learning could also have broader implications for other computer vision tasks involving complex relationships between visual elements.

Conclusion

This paper presents a novel contrastive learning approach, called Scaling Positives and Negatives (SPN), for improving composed image retrieval. The key idea is to scale the positive and negative samples during the contrastive loss computation to better capture the relationships between the individual visual elements in a composed image.

Through extensive experiments on multiple datasets, the authors demonstrate that the SPN approach outperforms previous state-of-the-art methods for composed image retrieval. This suggests that the scaling of positive and negative samples can be an effective way to learn more meaningful representations for this task.

The insights from this work could have broader implications for other computer vision problems involving complex relationships between visual elements, such as scene understanding or object detection. Future research could explore the interpretability of the learned representations and the computational efficiency of the SPN approach in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Zhangchi Feng, Richong Zhang, Zhijie Nie

The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our method also performs well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario. Our code and data are released at https://github.com/BUAADreamer/SPN4CIR.

8/9/2024

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song

Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query, i.e., a reference image and its corresponding modification text. While previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between time-consuming annotation cost and retrieval performance, recent researchers introduced the task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network based on pretrained CLIP model to realize it. Despite its promising performance, the approach suffers from two key limitations: insufficient multimodal query composition training and indiscriminative training triplet selection. To address these two limitations, in this work, we propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we employ a masked training strategy and advanced image caption generator to construct pseudo triplets from pure image data to enable the model to acquire primary knowledge related to multimodal query composition. In the second stage, based on active learning, we design a pseudo modification text-based query-target distance metric to evaluate the challenging score for each unlabeled sample. Meanwhile, we propose a robust top range-based random sampling strategy according to the 3-$sigma$ rule in statistics, to sample the challenging samples for fine-tuning the pretrained model. Notably, our scheme is plug-and-play and compatible with any existing supervised CIR models. We tested our scheme across three backbones on three public datasets (i.e., FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%, 25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.

7/9/2024

🖼️

HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels

Yingying Jiang, Hanchao Jia, Xiaobing Wang, Peng Hao

Composed Image Retrieval (CIR) aims to retrieve images based on a query image with text. Current Zero-Shot CIR (ZS-CIR) methods try to solve CIR tasks without using expensive triplet-labeled training datasets. However, the gap between ZS-CIR and triplet-supervised CIR is still large. In this work, we propose Hybrid CIR (HyCIR), which uses synthetic labels to boost the performance of ZS-CIR. A new label Synthesis pipeline for CIR (SynCir) is proposed, in which only unlabeled images are required. First, image pairs are extracted based on visual similarity. Second, query text is generated for each image pair based on vision-language model and LLM. Third, the data is further filtered in language space based on semantic similarity. To improve ZS-CIR performance, we propose a hybrid training strategy to work with both ZS-CIR supervision and synthetic CIR triplets. Two kinds of contrastive learning are adopted. One is to use large-scale unlabeled image dataset to learn an image-to-text mapping with good generalization. The other is to use synthetic CIR triplets to learn a better mapping for CIR tasks. Our approach achieves SOTA zero-shot performance on the common CIR benchmarks: CIRR and CIRCO.

7/10/2024

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, Xueming Qian

Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.

5/31/2024