CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Read original: arXiv:2303.11916 - Published 7/17/2024 by Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

🖼️

Overview

The paper proposes a novel diffusion-based model called CompoDiff for solving zero-shot Composed Image Retrieval (ZS-CIR) tasks.
It also introduces a new synthetic dataset, SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models.
CompoDiff and SynthTriplets18M aim to address the limitations of previous CIR approaches, such as poor generalizability due to small dataset scale and limited condition types.

Plain English Explanation

The paper introduces a new model called CompoDiff that uses a diffusion-based approach to solve a specific type of image retrieval task called zero-shot Composed Image Retrieval (ZS-CIR). In this task, the goal is to find an image that matches a given "recipe" or combination of text and image conditions.

For example, the input could be a description like "a sunny landscape with a red barn" along with a reference image of a landscape. The model needs to then retrieve an image from a database that matches this combination of text and visual elements.

Previous approaches to this task have struggled with limited dataset sizes and the ability to handle diverse types of conditions. To address this, the researchers created a large synthetic dataset called SynthTriplets18M, which contains 18.8 million example triplets of reference images, conditions, and target images.

CompoDiff and SynthTriplets18M are designed to improve the generalizability and versatility of CIR models. The paper shows that CompoDiff outperforms previous state-of-the-art methods on several ZS-CIR benchmarks. It also demonstrates new capabilities, like accepting negative text conditions and image masks as input, and allowing control over the relative importance of text and image in the query.

Technical Explanation

The key technical contributions of this paper are the CompoDiff model and the SynthTriplets18M dataset.

CompoDiff is a diffusion-based model that can solve the ZS-CIR task. Diffusion models work by starting with random noise and iteratively refining it to generate realistic images. CompoDiff leverages this process to retrieve images that match the given text and image conditions.

The model accepts various types of conditions, including text descriptions, reference images, negative text, and image masks. It learns to balance the influence of these different inputs to produce the desired output image.

SynthTriplets18M is a large-scale synthetic dataset created by the researchers to train and evaluate CIR models. It contains 18.8 million triplets of reference images, conditions, and target images. This dataset is orders of magnitude larger than previous CIR datasets, allowing models to learn more robust and generalizable representations.

The paper demonstrates that CompoDiff achieves state-of-the-art performance on four ZS-CIR benchmarks: FashionIQ, CIRR, CIRCO, and GeneCIS. It also shows that CompoDiff enables new capabilities, such as controlling the relative importance of text and image conditions, and handling negative text conditions.

Critical Analysis

The paper makes a compelling case for the advantages of CompoDiff and the SynthTriplets18M dataset. However, there are a few potential limitations and areas for further research:

The synthetic nature of the SynthTriplets18M dataset, while enabling large-scale training, may limit the model's ability to generalize to real-world data. Further testing on diverse, real-world CIR datasets would be valuable.
The paper does not provide a detailed analysis of the computational and memory requirements of CompoDiff, which could be an important practical consideration for deploying the model in production environments.
While the paper highlights the controllability of CompoDiff, it would be interesting to explore the model's robustness to noisy or adversarial conditions, and its ability to handle open-ended, freeform text descriptions.
Pseudo-Triplet Guided Few-Shot Composed Image (HYCIR) is another recent approach to CIR that could be an interesting point of comparison for CompoDiff.

Overall, the paper presents a significant advancement in the field of Compositional Image Decomposition Diffusion Models and opens up new possibilities for Pseudo-Triplet Guided Few-Shot Composed Image retrieval tasks.

Conclusion

The proposed CompoDiff model and SynthTriplets18M dataset represent an important step forward in solving the challenging problem of zero-shot Composed Image Retrieval (ZS-CIR). By leveraging diffusion-based techniques and a large-scale synthetic dataset, the researchers have developed a versatile and high-performing system that can handle a variety of text and image conditions.

The implications of this work are significant, as CIR has numerous real-world applications, such as product search, image editing, and creative ideation. The improved generalizability and controllability of CompoDiff could lead to more robust and user-friendly image retrieval systems that better understand and respond to complex user requests.

As the field of AI continues to advance, research like this that pushes the boundaries of what's possible in multimodal perception and generation will be crucial for developing more intelligent and helpful technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff

7/17/2024

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

Xinchen Zhang, Ling Yang, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui

Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. Our code is available at: https://github.com/YangLing0818/RealCompo

6/5/2024

Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Zhangchi Feng, Richong Zhang, Zhijie Nie

The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our method also performs well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario. Our code and data are released at https://github.com/BUAADreamer/SPN4CIR.

8/9/2024

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Zhekai Chen, Wen Wang, Zhen Yang, Zeqing Yuan, Hao Chen, Chunhua Shen

We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. Rather than concentrating on specific use cases such as appearance editing (image harmonization) or semantic editing (semantic image composition), we showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition applicable to both scenarios. We observe that the pre-trained diffusion models automatically identify simple copy-paste boundary areas as low-density regions during denoising. Building on this insight, we propose to optimize the composed image towards high-density regions guided by the diffusion prior. In addition, we introduce a novel maskguided loss to further enable flexible semantic image composition. Extensive experiments validate the superiority of our approach in achieving generic zero-shot image composition. Additionally, our approach shows promising potential in various tasks, such as object removal and multiconcept customization.

7/9/2024