Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Read original: arXiv:2406.18836 - Published 6/28/2024 by Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Overview

This paper presents a novel approach for zero-shot composed image retrieval (zero-shot CIR) that considers the relationship between the query and target images.
The proposed method leverages masked image-text pairs to learn the query-target relationship and improve retrieval performance.
The research explores how modeling the relationship between query and target images can enhance zero-shot CIR, which is the task of retrieving relevant images for a compositional query without any training examples.

Plain English Explanation

The paper discusses a new way to search for images that match a complex, multi-part query without having any previous examples to train on. This is called "zero-shot composed image retrieval" (zero-shot CIR).

The key idea is to use masked image-text pairs - images with parts of the text hidden - to learn the relationship between the query (the thing you're searching for) and the target image (the image you want to find). By understanding this relationship, the system can better match queries to relevant images, even if it's never seen that exact query before.

For example, if you search for "a person riding a bike on a sunny day", the system needs to understand that the query involves a person, a bike, and a sunny day. By learning these kinds of relationships from masked image-text pairs, the system can make better guesses about which images match that complex query, even if it's never seen that exact combination before.

The researchers show that this approach of considering the query-target relationship leads to better performance on zero-shot CIR tasks compared to previous methods. It's an interesting step forward in making image search more flexible and powerful, even when dealing with novel, composed queries.

Technical Explanation

The paper proposes a novel zero-shot CIR method that models the relationship between the query and target images to improve retrieval performance. The key technical contributions include:

Masked Image-Text Pairs: The authors leverage masked image-text pairs to learn the relationship between the query and target images. By randomly masking parts of the text descriptions, the model is forced to learn how the visual and textual elements are associated.
Query-Target Relationship Learning: The model learns to predict the masked textual elements from the image, capturing the inherent relationship between the query and target. This relationship understanding is then used to enhance the zero-shot CIR task.
Zero-shot CIR Architecture: The authors develop a dual-encoder architecture that jointly encodes the query and target images. The encoded representations are then used to compute a relevance score, allowing the model to retrieve relevant images for a given compositional query.

The paper evaluates the proposed approach on standard zero-shot CIR benchmarks and demonstrates significant performance improvements over state-of-the-art methods. The authors attribute these gains to the model's ability to effectively capture and leverage the query-target relationship.

Critical Analysis

The paper presents a compelling approach to zero-shot CIR that addresses an important limitation of previous methods - the lack of understanding about the relationship between the query and target images. By incorporating this relationship learning, the proposed approach can better match complex, compositional queries to relevant images.

However, the paper does not extensively discuss the limitations or potential drawbacks of the method. For example, the reliance on masked image-text pairs may introduce biases or assumptions that could impact the model's performance in real-world scenarios. Additionally, the paper does not explore the interpretability or explainability of the learned query-target relationships, which could be valuable for understanding the model's decision-making process.

Further research could investigate the generalization of the approach to more diverse datasets, as well as explore ways to make the relationship learning more transparent and controllable. Incorporating human evaluation or feedback loops could also help ensure the relevance and usefulness of the retrieved images for end-users.

Conclusion

This paper presents a novel zero-shot CIR method that considers the relationship between the query and target images, leveraging masked image-text pairs to learn this relationship and improve retrieval performance. The proposed approach demonstrates significant improvements over state-of-the-art methods, highlighting the importance of modeling the inherent connections between the query and target in zero-shot compositional image retrieval tasks.

The work contributes to the ongoing efforts to make image search more flexible and powerful, allowing users to find relevant images for complex, novel queries without the need for extensive training data. As the field of zero-shot learning continues to evolve, this research offers a promising direction for enhancing the robustness and effectiveness of compositional image retrieval systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama

This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.

6/28/2024

🖼️

HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels

Yingying Jiang, Hanchao Jia, Xiaobing Wang, Peng Hao

Composed Image Retrieval (CIR) aims to retrieve images based on a query image with text. Current Zero-Shot CIR (ZS-CIR) methods try to solve CIR tasks without using expensive triplet-labeled training datasets. However, the gap between ZS-CIR and triplet-supervised CIR is still large. In this work, we propose Hybrid CIR (HyCIR), which uses synthetic labels to boost the performance of ZS-CIR. A new label Synthesis pipeline for CIR (SynCir) is proposed, in which only unlabeled images are required. First, image pairs are extracted based on visual similarity. Second, query text is generated for each image pair based on vision-language model and LLM. Third, the data is further filtered in language space based on semantic similarity. To improve ZS-CIR performance, we propose a hybrid training strategy to work with both ZS-CIR supervision and synthetic CIR triplets. Two kinds of contrastive learning are adopted. One is to use large-scale unlabeled image dataset to learn an image-to-text mapping with good generalization. The other is to use synthetic CIR triplets to learn a better mapping for CIR tasks. Our approach achieves SOTA zero-shot performance on the common CIR benchmarks: CIRR and CIRCO.

7/10/2024

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

Lorenzo Agnolucci, Alberto Baldrati, Marco Bertini, Alberto Del Bimbo

Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at https://github.com/miccunifi/SEARLE.

5/7/2024

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song

Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query, i.e., a reference image and its corresponding modification text. While previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between time-consuming annotation cost and retrieval performance, recent researchers introduced the task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network based on pretrained CLIP model to realize it. Despite its promising performance, the approach suffers from two key limitations: insufficient multimodal query composition training and indiscriminative training triplet selection. To address these two limitations, in this work, we propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we employ a masked training strategy and advanced image caption generator to construct pseudo triplets from pure image data to enable the model to acquire primary knowledge related to multimodal query composition. In the second stage, based on active learning, we design a pseudo modification text-based query-target distance metric to evaluate the challenging score for each unlabeled sample. Meanwhile, we propose a robust top range-based random sampling strategy according to the 3-$sigma$ rule in statistics, to sample the challenging samples for fine-tuning the pretrained model. Notably, our scheme is plug-and-play and compatible with any existing supervised CIR models. We tested our scheme across three backbones on three public datasets (i.e., FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%, 25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.

7/9/2024