Refining Skewed Perceptions in Vision-Language Models through Visual Representations

2405.14030

Published 5/24/2024 by Haocheng Dai, Sarang Joshi

🔍

Abstract

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.

Create account to get full access

Overview

Large vision-language models (VLMs) like CLIP have shown remarkable success across many tasks, but inherit biases from real-world datasets
These biases can lead to misconceptions about the actual environment and diminish VLM performance when key contextual elements are absent
This study investigates how a simple linear probe can distill task-specific core features from CLIP's embedding for downstream applications

Plain English Explanation

Large vision-language models (VLMs) like CLIP have become very powerful and useful for a variety of applications. However, similar to other foundational AI systems, these models can inherit biases from the real-world data they are trained on. This means they may make incorrect assumptions or have misconceptions about the actual environment they are operating in.

Datasets commonly used to train VLMs, like ImageNet, often contain non-causal, spurious correlations. This means the models learn associations that don't actually reflect the true underlying relationships. When these contextual cues are missing, it can diminish the VLM's performance.

The researchers in this study looked at how a simple linear model can effectively extract the core, task-specific features from CLIP's embeddings. Their analysis revealed that CLIP's text representations are often tainted by these spurious correlations inherited from the biased pre-training data. Interestingly, they found that relying more on CLIP's visual representations, rather than the text embeddings, was a better way to overcome these embedded biases.

Technical Explanation

This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. The researchers' analysis reveals that the CLIP text representations are often tainted by spurious correlations, which are inherited from the biased pre-training dataset.

Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs. This emphasizes the superior utility of visual representations in overcoming embedded biases, compared to the text embeddings.

The researchers also provide additional context on how VLMs like CLIP are trained and the types of biases they can inherit. They note that prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent.

Critical Analysis

The researchers acknowledge several caveats and limitations in their work. They note that while their findings emphasize the superiority of visual representations over text embeddings for overcoming biases, there may be specific tasks or scenarios where text-based features are still valuable.

Additionally, the researchers suggest that further research is needed to develop more sophisticated techniques for distilling and refining task-specific representations from large VLMs. The simple linear probe used in this study may have limitations in capturing the nuanced relationships within the data.

Overall, this research provides important insights into the biases inherent in large vision-language models and highlights the potential advantages of leveraging visual representations to overcome these issues. However, continued work is needed to fully address the challenges of bias and robustness in these powerful AI systems.

Conclusion

This study investigates how a simple linear probe can effectively extract task-specific core features from CLIP's embedding for downstream applications. The researchers found that CLIP's text representations are often tainted by spurious correlations inherited from biased pre-training data, while the visual representations demonstrate superior utility in overcoming these embedded biases.

These findings contribute to our understanding of the limitations and biases present in large vision-language models, and suggest that a greater emphasis on visual processing may be a promising approach for developing more robust and reliable AI systems. As the field continues to advance, further research will be needed to build on these insights and develop more sophisticated techniques for mitigating bias in foundational AI models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.

5/29/2024

cs.CV cs.AI cs.CL cs.LG

The Neglected Tails in Vision-Language Models

Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage and 10,000x less training time!

5/24/2024

cs.CV cs.CL cs.LG

RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP

6/12/2024

cs.CV

They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias

Salma Abdel Magid, Jui-Hsien Wang, Kushal Kafle, Hanspeter Pfister

Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.

6/18/2024

cs.CV cs.IR cs.LG