When are Lemons Purple? The Concept Association Bias of Vision-Language Models

2212.12043

Published 4/16/2024 by Yutaro Yamada, Yingtian Tang, Yoyo Zhang, Ilker Yildirim

🔍

Abstract

Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such performance does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). As a potential cause of the difficulty of applying these models to VQA and similar tasks, we report an interesting phenomenon of vision-language models, which we call the Concept Association Bias (CAB). We find that models with CAB tend to treat input as a bag of concepts and attempt to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. We demonstrate CAB by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. eggplant) and an attribute (e.g. color purple). We also show that the strength of CAB predicts the performance on VQA. We observe that CAB is prevalent in vision-language models trained with contrastive losses, even when autoregressive losses are jointly employed. However, a model that solely relies on autoregressive loss seems to exhibit minimal or no signs of CAB.

Create account to get full access

Overview

Large-scale vision-language models like CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval.
However, these models struggle with tasks that require finer-grained correspondence between vision and language, such as Visual Question Answering (VQA).
This paper investigates a potential cause for this challenge, a phenomenon called Concept Association Bias (CAB).

Plain English Explanation

Vision-language models like CLIP have become quite skilled at recognizing objects in images and connecting them to relevant text. For example, if you show CLIP an image of an eggplant, it can accurately identify the object and provide a textual description.

However, these models struggle with more nuanced tasks that require a deeper understanding of the relationship between visual and linguistic information. For instance, when asked a specific question about an image, such as "What color is the eggplant?", the model may not perform as well.

The researchers in this paper identified a potential reason for this challenge, which they call the Concept Association Bias (CAB). CAB occurs when the model treats the input as a collection of individual concepts and tries to fill in the missing information by relying on associated concepts, rather than truly understanding the visual-linguistic relationship.

For example, if the model strongly associates the concept of "eggplant" with the color "purple," it may automatically assume the eggplant in the image is purple, even if the actual color is different. This type of bias can lead to unexpected and incorrect predictions, especially in tasks that require more precise visual-linguistic reasoning.

The researchers demonstrate this phenomenon by showing that CLIP's zero-shot classification performance suffers when there is a strong concept association between an object and an attribute. They also find that the strength of this bias can predict how well the model will perform on VQA tasks.

Technical Explanation

The paper investigates the Concept Association Bias (CAB) exhibited by large-scale vision-language models, which may contribute to their difficulty in tasks requiring fine-grained visual-linguistic correspondence, such as Visual Question Answering (VQA).

The researchers find that models with CAB tend to treat input as a "bag of concepts" and attempt to fill in the missing concepts based on cross-modal associations, leading to unexpected zero-shot predictions. They demonstrate this bias by showing that CLIP's zero-shot classification performance significantly suffers when there is a strong concept association between an object (e.g., eggplant) and an attribute (e.g., color purple).

The authors also show that the strength of CAB can predict the performance of these models on VQA tasks. They observe that CAB is prevalent in vision-language models trained with contrastive losses, even when autoregressive losses are also used. In contrast, a model that solely relies on autoregressive loss appears to exhibit minimal or no signs of CAB.

These findings suggest that the Concept Association Bias is a potential factor contributing to the difficulty of applying large-scale vision-language models to tasks that require a deeper understanding of the visual-linguistic relationship, such as VQA and evolving interpretable visual classifiers.

Critical Analysis

The paper provides a thoughtful analysis of an interesting phenomenon observed in large-scale vision-language models, the Concept Association Bias (CAB). The researchers present a clear and well-designed experimental setup to demonstrate the impact of CAB on model performance, particularly in the context of Visual Question Answering (VQA).

One potential limitation of the study is the specific focus on the CLIP model, which may limit the generalizability of the findings to other vision-language architectures. It would be valuable to investigate whether CAB is a more widespread issue across a broader range of large language models (LLMs) and vision-language models.

Additionally, the paper does not delve into the potential causes of CAB, beyond the observation that it is more prevalent in models trained with contrastive losses. Further research could explore the underlying mechanisms that contribute to the emergence of this bias and investigate potential mitigation strategies.

Overall, this paper makes a valuable contribution to the understanding of the challenges in applying large-scale vision-language models to more complex tasks and highlights the importance of carefully studying the biases and limitations of these powerful models.

Conclusion

This paper investigates an interesting phenomenon called the Concept Association Bias (CAB) observed in large-scale vision-language models, which may explain their difficulty in performing well on tasks that require a deeper understanding of the visual-linguistic relationship, such as Visual Question Answering (VQA).

The researchers demonstrate that models with CAB tend to treat input as a "bag of concepts" and attempt to fill in the missing concepts based on cross-modal associations, leading to unexpected and incorrect predictions. They show that the strength of CAB can predict the performance of these models on VQA tasks.

These findings suggest that addressing the Concept Association Bias may be a key step in enabling vision-language models to excel in more nuanced and contextual tasks that require a deeper integration of visual and linguistic understanding. Further research is needed to explore the underlying causes of CAB and develop strategies to mitigate this bias.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

The Neglected Tails in Vision-Language Models

Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong

Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage and 10,000x less training time!

5/24/2024

cs.CV cs.CL cs.LG

Improving Concept Alignment in Vision-Language Concept Bottleneck Models

Nithish Muthuchamy Selvaraj, Xiaobao Guo, Bingquan Shen, Adams Wai-Kin Kong, Alex Kot

Concept Bottleneck Models (CBM) map the input image to a high-level human-understandable concept space and then make class predictions based on these concepts. Recent approaches automate the construction of CBM by prompting Large Language Models (LLM) to generate text concepts and then use Vision Language Models (VLM) to obtain concept scores to train a CBM. However, it is desired to build CBMs with concepts defined by human experts instead of LLM generated concepts to make them more trustworthy. In this work, we take a closer inspection on the faithfulness of VLM concept scores for such expert-defined concepts in domains like fine-grain bird species classification and animal classification. Our investigations reveal that frozen VLMs, like CLIP, struggle to correctly associate a concept to the corresponding visual input despite achieving a high classification performance. To address this, we propose a novel Contrastive Semi-Supervised (CSS) learning method which uses a few labeled concept examples to improve concept alignment (activate truthful visual concepts) in CLIP model. Extensive experiments on three benchmark datasets show that our approach substantially increases the concept accuracy and classification accuracy, yet requires only a fraction of the human-annotated concept labels. To further improve the classification performance, we also introduce a new class-level intervention procedure for fine-grain classification problems that identifies the confounding classes and intervenes their concept space to reduce errors.

5/6/2024

cs.CV

🔍

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Haocheng Dai, Sarang Joshi

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.

5/24/2024

cs.CV cs.CL

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

Do vision-language models (VLMs) pre-trained to caption an image of a durian learn visual concepts such as brown (color) and spiky (texture) at the same time? We aim to answer this question as visual concepts learned for free would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. spiky as opposed to spiky durian), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.

4/22/2024

cs.CV cs.AI cs.CL cs.LG