ComCLIP: Training-Free Compositional Image and Text Matching






Published 4/16/2024 by Kenan Jiang, Xuehai He, Ruize Xu, Xin Eric Wang



Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel textbf{textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the textbf{textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at

Create account to get full access


If you already have an account, we'll log you in


  • The paper proposes a novel "training-free" model called ComCLIP that aims to improve the compositional generalization of vision-language models like CLIP.
  • ComCLIP disentangles input images into subjects, objects, and actions, and then composes the CLIP vision and text encoders to perform matching over the compositional sub-image and text embeddings.
  • This approach aims to mitigate spurious correlations in the pretrained CLIP model and dynamically evaluate the importance of each visual component.
  • Experiments show ComCLIP can boost the zero-shot performance of CLIP, SLIP, and BLIP2 on compositional image-text matching tasks.

Plain English Explanation

The paper focuses on a key challenge in vision-language models like CLIP: understanding the compositional relationship between words and visual elements. While these models can generally match images and text well, they can struggle when the task requires deeper comprehension of how different components in an image relate to the words describing it.

To address this, the researchers propose a new model called ComCLIP. Instead of just looking at the full image, ComCLIP breaks it down into separate parts - the subject, the object, and the action. It then uses the CLIP model's vision and text encoders to analyze how these different visual components match up with the compositional meaning of the text description.

By disentangling the image in this way, ComCLIP can better understand the underlying relationships between the words and visual elements, rather than just relying on superficial correlations that the original CLIP model may have learned. This allows ComCLIP to perform better on tasks that require more sophisticated compositional understanding, like matching images to complex text descriptions.

The key insight is that the errors made by the original CLIP model are often caused by "confounding factors" - aspects of the image that are spuriously correlated with the text, but don't actually capture the true semantics. By explicitly modeling the different components of the image and text, ComCLIP can overcome these issues and achieve better zero-shot performance on challenging compositional tasks.

Technical Explanation

The core idea behind ComCLIP is to disentangle the input image into separate sub-images representing the subject, object, and action, and then use the CLIP vision and text encoders to perform matching between the compositional text embedding and the sub-image embeddings.

Specifically, ComCLIP first uses off-the-shelf object detection and action recognition models to identify the key visual components in the input image. It then extracts sub-images corresponding to the detected subject, object, and action. These sub-images are then passed through the CLIP vision encoder to obtain their respective embeddings.

On the text side, ComCLIP uses the CLIP text encoder to obtain an embedding of the full text description. It then decomposes this text embedding into subject, object, and action components using a learned composition function.

Finally, ComCLIP computes the similarity between the compositional text embedding and the sub-image embeddings, and aggregates these scores to obtain the overall image-text matching score. This allows the model to dynamically evaluate the importance of each visual component in relation to the text, rather than relying on the fixed correlations learned by the original CLIP model.

The key technical novelty is this "training-free" compositional adaptation of the pretrained CLIP model, which enables improved zero-shot performance on challenging image-text matching tasks that require deeper semantic understanding, such as the SVO, ComVG, Winoground, and VL-checklist datasets.

Critical Analysis

The ComCLIP approach is a clever and principled way to address the limitations of existing vision-language models like CLIP when it comes to compositional reasoning. By explicitly modeling the different visual components and their relationship to the text, the model can better capture the underlying semantics and avoid being misled by spurious correlations.

However, the reliance on external object detection and action recognition models introduces some potential issues. If these models make errors in their predictions, it could negatively impact the performance of ComCLIP. Additionally, the composition function used to decompose the text embedding may also be a potential source of error.

It would be interesting to see how ComCLIP performs compared to more sophisticated compositional vision-language models, such as those that use structured representations or neural module networks. Further research could also explore ways to integrate the sub-image disentanglement and composition directly into the end-to-end training process, rather than as a separate post-processing step.

Overall, the ComCLIP approach is a valuable contribution to the field of vision-language understanding, demonstrating the importance of considering compositional and causal reasoning when designing these models.


The ComCLIP paper presents a novel approach to improving the compositional generalization of vision-language models like CLIP. By disentangling input images into subjects, objects, and actions, and then composing the CLIP vision and text encoders to perform matching over the sub-image and text embeddings, ComCLIP is able to mitigate the spurious correlations learned by the original CLIP model and better capture the underlying semantics.

Experiments show that this "training-free" compositional adaptation of CLIP can boost its zero-shot performance on a range of challenging image-text matching tasks, including SVO, ComVG, Winoground, and VL-checklist. This work highlights the importance of compositional and causal reasoning in vision-language models, and paves the way for further advancements in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models

Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models

Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kattakinda, Soheil Feizi





Recent text-to-image diffusion-based generative models have the stunning ability to generate highly detailed and photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate this compositionality-based failure mode and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes which shows that the output space of the CLIP text-encoder is sub-optimal, and (ii) we observe that the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that the best compositional improvements can be achieved (without harming the model's FID scores) by fine-tuning {it only} a simple linear projection on CLIP's representation space in Stable-Diffusion variants using a small set of compositional image-text pairs. This result demonstrates that the sub-optimality of the CLIP's output space is a major error source. We also show that re-weighting the erroneous attention contributions in CLIP can also lead to improved compositional performances, however these improvements are often less significant than those achieved by solely learning a linear projection head, highlighting erroneous attentions to be only a minor error source.

Read more


RankCLIP: Ranking-Consistent Language-Image Pretraining

RankCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun





Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

Read more


Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang, Rabiul Awal, Aishwarya Agrawal





Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in bag-of-words representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at

Read more


Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Andreas Koukounas, Georgios Mastrapas, Michael Gunther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Mart'inez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao





Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Read more
