iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval






Published 5/7/2024 by Lorenzo Agnolucci, Alberto Baldrati, Marco Bertini, Alberto Del Bimbo
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval


Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at

  • This paper introduces iSEARLE, an approach that improves Textual Inversion for zero-shot composed image retrieval.
  • Textual Inversion is a technique that allows text to be used to generate or edit images, but it has limitations in its ability to handle complex compositions.
  • iSEARLE aims to address these limitations by introducing a novel training process and architectural changes to the Textual Inversion model.

Plain English Explanation

The paper presents a new method called iSEARLE that enhances the ability of Textual Inversion to retrieve images that are composed of multiple elements. Textual Inversion is a technique that enables text to be used to generate or modify images, but it struggles when the images involve complex combinations of objects, scenes, or other visual elements.

To overcome this, the researchers developed iSEARLE, which includes changes to the training process and the model architecture. These improvements allow the Textual Inversion model to better understand and represent the relationships between different visual elements, making it more effective at retrieving images that match complex textual descriptions.

For example, with regular Textual Inversion, it might be difficult to retrieve an image of "a dog playing fetch in a park" because the model doesn't fully grasp how a dog, a ball, and a park scene all fit together. But iSEARLE's enhancements enable the model to better capture those intricate compositional relationships, making it better at finding appropriate images for such complex queries.

Technical Explanation

The paper introduces iSEARLE, a method that improves Textual Inversion for zero-shot composed image retrieval. Textual Inversion is a technique that allows text to be used for image generation and editing, but it struggles with complex visual compositions involving multiple elements.

To address these limitations, the researchers propose several key innovations in iSEARLE:

  1. Compositional Training: The model is trained on a dataset of composed images, where each image contains multiple visual elements. This helps the model learn how to represent and reason about the relationships between different visual components.

  2. Multimodal Prompt Engineering: The textual input to the model is structured in a way that explicitly encodes the compositional nature of the desired image. This includes separate prompts for each visual element, as well as relational cues between them.

  3. Architectural Modifications: The base Textual Inversion model is extended with additional modules that specialize in modeling the interactions and dependencies between different visual elements in a composition.

These changes allow iSEARLE to better understand and represent the complex relationships between objects, scenes, and other visual components, enabling more accurate retrieval of composed images from textual descriptions.

The paper evaluates iSEARLE on a zero-shot composed image retrieval task, where the model is tasked with finding relevant images for textual prompts describing complex visual compositions. The results demonstrate significant improvements over the original Textual Inversion approach, showcasing the effectiveness of the proposed innovations.

Critical Analysis

The paper makes a valuable contribution by addressing a key limitation of Textual Inversion - its inability to handle complex, compositional visual queries. The proposed iSEARLE approach successfully enhances the model's ability to understand and represent the relationships between different visual elements, leading to improved retrieval performance.

However, the paper also acknowledges some potential limitations and areas for further research:

  1. Scalability: While iSEARLE demonstrates strong results on the evaluated dataset, it's unclear how the approach would scale to handle even more complex or diverse visual compositions. Further testing on a wider range of compositional scenarios would be helpful.

  2. Interpretability: The architectural changes introduced in iSEARLE, while effective, may make the model less interpretable. Exploring ways to maintain transparency and explainability could be an interesting direction for future work.

  3. Generalization: The paper focuses on zero-shot retrieval, but it's worth investigating how well iSEARLE's improvements translate to other Textual Inversion tasks, such as image generation or editing.

  4. Computational Efficiency: The additional modules and training process in iSEARLE may come with increased computational and memory requirements. Investigating ways to optimize the efficiency of the approach could broaden its real-world applicability.

Overall, the iSEARLE method represents a meaningful step forward in enhancing Textual Inversion's ability to handle complex visual compositions. By thoughtfully addressing this challenge, the researchers have opened up new possibilities for more advanced and versatile multimodal learning systems.


The paper introduces iSEARLE, a novel approach that improves Textual Inversion for the task of zero-shot composed image retrieval. By incorporating compositional training, multimodal prompt engineering, and architectural modifications, iSEARLE enables the Textual Inversion model to better understand and represent the relationships between different visual elements in a scene.

The authors demonstrate that iSEARLE significantly outperforms the original Textual Inversion method on a zero-shot composed image retrieval benchmark, highlighting the effectiveness of the proposed innovations. While the paper acknowledges some potential limitations and areas for further research, the iSEARLE method represents an important step forward in enhancing the capabilities of Textual Inversion and advancing the field of multimodal learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

