CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts






Published 4/30/2024 by Yichao Cai, Yuhang Liu, Zhen Zhang, Javen Qinfeng Shi



Contrastive vision-language models, such as CLIP, have garnered considerable attention for various dowmsteam tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often blend content and style information, which somewhat limits their generalization capabilities under distribution shifts. To address this limitation, we adopt a causal generative perspective for multimodal data and propose contrastive learning with data augmentation to disentangle content features from the original representations. To achieve this, we begins with exploring image augmentation techniques and develop a method to seamlessly integrate them into pre-trained CLIP-like models to extract pure content features. Taking a step further, recognizing the inherent semantic richness and logical structure of text data, we explore the use of text augmentation to isolate latent content from style features. This enables CLIP-like model's encoders to concentrate on latent content information, refining the learned representations by pre-trained CLIP-like models. Our extensive experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks, alongside enhanced robustness to various perturbations. These results underscore the effectiveness of our proposed methods in refining vision-language representations and advancing the state-of-the-art in multimodal learning.

Create account to get full access


If you already have an account, we'll log you in


  • Contrastive vision-language models like CLIP have shown impressive generalization abilities, but their learned features often blend content and style information
  • To address this limitation, the researchers adopt a causal generative perspective and propose contrastive learning with data augmentation to disentangle content features from the original representations
  • They explore image and text augmentation techniques to extract pure content features and refine the representations learned by CLIP-like models
  • Extensive experiments demonstrate significant improvements in zero-shot and few-shot classification tasks, as well as enhanced robustness to perturbations

Plain English Explanation

Contrastive vision-language models, like the popular CLIP model, have become very effective at various tasks by learning powerful visual and language features. However, these models often struggle to fully separate the underlying "content" information from the "style" information in the data they're trained on.

To address this, the researchers in this paper take a new approach inspired by causal modeling. They use data augmentation techniques to help the model learn features that are more focused on the core content, rather than getting distracted by the superficial style details. For images, this might involve applying transformations like cropping or color changes, while for text, it could mean paraphrasing or altering the sentence structure.

By incorporating these content-focused data augmentation methods into the training of CLIP-like models, the researchers were able to extract "purer" content features that generalized better. In their experiments, this led to significant improvements in the model's performance on zero-shot and few-shot classification tasks, as well as making it more robust to different types of perturbations or distribution shifts.

The key innovation here is using causal reasoning and targeted data augmentation to help these powerful vision-language models learn representations that are more focused on the underlying meaning and content, rather than getting overly entangled with superficial stylistic details. This advancement could lead to even more capable and versatile contrastive models in the future.

Technical Explanation

The researchers begin by observing that the features learned by contrastive vision-language models like CLIP often blend content and style information, which limits their generalization capabilities under distribution shifts. To address this, they adopt a causal generative perspective for multimodal data and propose contrastive learning with data augmentation to disentangle content features from the original representations.

First, the researchers explore image augmentation techniques and develop a method to seamlessly integrate them into pre-trained CLIP-like models. This allows the model to extract purer content features by learning to be invariant to style-related transformations. Building on this, the researchers recognize the inherent semantic richness and logical structure of text data, and explore the use of text augmentation to isolate latent content from style features. This further enables the CLIP-like model's encoders to concentrate on the core content information, refining the learned representations.

The researchers conduct extensive experiments across diverse datasets, evaluating their approach on zero-shot and few-shot classification tasks, as well as assessing the models' robustness to various perturbations. The results demonstrate significant improvements over the original CLIP-like models, highlighting the effectiveness of their proposed methods in refining vision-language representations and advancing the state-of-the-art in multimodal learning.

Critical Analysis

The researchers' approach of using causal reasoning and targeted data augmentation to disentangle content and style features in contrastive vision-language models is a compelling and well-executed idea. By focusing the models' learning on the core content information, rather than getting distracted by superficial stylistic details, they were able to achieve significant performance gains in challenging tasks like zero-shot and few-shot classification.

That said, the paper does not delve deeply into the potential limitations or caveats of their method. For example, it would be interesting to explore how the performance and robustness gains might vary across different domains or dataset characteristics. Additionally, the researchers do not provide a detailed analysis of the learned content features or how they differ from the original CLIP-like representations.

Furthermore, while the text augmentation techniques seem promising, the paper lacks a thorough exploration of their effectiveness compared to the image augmentation methods. It would be valuable to understand the relative contributions of these modality-specific augmentation strategies and how they interact with the overall contrastive learning framework.

Overall, the research presented in this paper represents an important step forward in improving the generalization and robustness of contrastive vision-language models. However, further investigation into the limitations, potential biases, and broader applicability of the proposed methods could lead to even more impactful advancements in this rapidly evolving field of multimodal learning.


This paper introduces a novel approach to refining the representations learned by contrastive vision-language models like CLIP by leveraging causal reasoning and targeted data augmentation. By disentangling content and style features, the researchers were able to significantly improve the models' performance on zero-shot and few-shot classification tasks, as well as enhance their robustness to various perturbations.

The key innovation here is the use of image and text augmentation techniques to help the model focus on learning the core content information, rather than getting distracted by superficial stylistic details. This advancement could lead to even more capable and versatile contrastive models that can better generalize to a wider range of real-world applications.

While the paper does not fully explore the limitations and potential biases of the proposed methods, the researchers have made an important contribution to the field of multimodal learning. Their work highlights the value of incorporating causal reasoning and targeted data augmentation into the training of powerful vision-language models, paving the way for further advancements in this rapidly evolving area of artificial intelligence.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas





There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.

Read more


RankCLIP: Ranking-Consistent Language-Image Pretraining

RankCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun





Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

Read more


Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang, Rabiul Awal, Aishwarya Agrawal





Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in bag-of-words representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at

Read more


RWKV-CLIP: A Robust Vision-Language Representation Learner

RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng





Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at

Read more
