Why are Visually-Grounded Language Models Bad at Image Classification?

Read original: arXiv:2405.18415 - Published 5/29/2024 by Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

Why are Visually-Grounded Language Models Bad at Image Classification?

Overview

Visually-Grounded Language Models (VLMs) are AI systems that combine computer vision and natural language processing capabilities.
Despite their impressive performance on language tasks, VLMs are surprisingly poor at image classification compared to specialized vision models.
This paper investigates why VLMs struggle with image classification and discusses potential solutions to this problem.

Plain English Explanation

Why are Visually-Grounded Language Models Bad at Image Classification?

Visually-Grounded Language Models (VLMs) are a type of AI system that can understand both visual and textual information. These models are trained on large datasets that contain images and the corresponding captions or descriptions. This allows them to learn how language is used to describe the visual world.

While VLMs excel at language-related tasks like question answering or text generation, they often perform poorly on specialized image classification tasks compared to models designed solely for computer vision. This paper explores the reasons behind this surprising gap in performance.

Technical Explanation

VLMs are Bad at Image Classification

The paper argues that VLMs struggle with image classification for a few key reasons:

Optimization Mismatch: VLMs are optimized for language modeling rather than image classification, so their visual representations may not be well-suited for discriminating between different object categories.
Biased Training Data: The image-text pairs used to train VLMs often come from the internet, which can introduce biases and skewed perceptions about certain objects or scenes.
Suboptimal Architectural Design: VLMs may not have the right architectural components or inductive biases to effectively process and reason about visual information for classification tasks.

The paper also discusses potential solutions to address these limitations, such as fine-tuning VLMs on classification datasets or incorporating more specialized vision modules into their architecture.

Critical Analysis

The paper provides a thoughtful analysis of the challenges VLMs face in image classification and offers potential avenues for improvement. However, it is important to note that the research is still in its early stages, and further experimentation and validation will be necessary to fully understand the root causes of this performance gap.

Additionally, the paper does not delve into the potential implications of this finding for the broader development of multimodal AI systems. As VLMs become more prevalent, it will be crucial to address their shortcomings in order to ensure they can be safely and effectively deployed in real-world applications.

Conclusion

This paper sheds light on an intriguing and counterintuitive finding: despite their impressive capabilities in language tasks, Visually-Grounded Language Models struggle with image classification, a core computer vision task. By investigating the underlying reasons for this performance gap, the researchers hope to inform the development of more robust and versatile multimodal AI systems that can seamlessly integrate visual and linguistic understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.

5/29/2024

🔍

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Haocheng Dai, Sarang Joshi

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.

5/24/2024

Improving the Efficiency of Visually Augmented Language Models

Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

Despite the impressive performance of autoregressive Language Models (LM) it has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they do not know much about the visual world and its properties. To augment LMs with visual knowledge, existing solutions often rely on explicit images, requiring time-consuming retrieval or image generation systems. This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. For a fair comparison, we modify VALM, a visually-augmented LM which uses image retrieval and representation, to work directly with visually-grounded text representations. We name this new model BLIND-VALM. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks, despite being significantly more efficient and simpler. We also show that scaling up our model within the compute budget of VALM, either increasing the model or pre-training corpus size, we outperform VALM for all the evaluation tasks.

9/18/2024

Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation

Jheng-Hong Yang, Jimmy Lin

Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale textit{ad hoc} retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall's $tau sim 0.4$ when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's $kappa$ value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.

8/6/2024