What is the Visual Cognition Gap between Humans and Multimodal LLMs?

2406.10424

Published 6/18/2024 by Xu Cao, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz, Jintai Chen, Jianguo Cao, James M. Rehg

cs.CV cs.AI

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Abstract

Recently, Multimodal Large Language Models (MLLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level reasoning is not well-established. One such challenge is abstract visual reasoning (AVR) -- the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children. Inspired by the AVR tasks in Raven's Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA and a new benchmark VCog-Bench containing three datasets to evaluate the zero-shot AVR capability of MLLMs and compare their performance with existing human intelligent investigation. Our comparative experiments with different open-source and closed-source MLLMs on the VCog-Bench revealed a gap between MLLMs and human intelligence, highlighting the visual cognitive limitations of current MLLMs. We believe that the public release of VCog-Bench, consisting of MaRs-VQA, and the inference pipeline will drive progress toward the next generation of MLLMs with human-like visual cognition abilities.

Create account to get full access

Overview

• This paper explores the visual cognition gap between humans and multimodal large language models (LLMs), which are AI systems trained on both text and visual data.

• The researchers investigate how well these multimodal LLMs can understand and reason about visual information, and how their abilities compare to human visual cognition.

• They develop a Cognitive Evaluation Benchmark for Image Reasoning and Description to assess and compare the visual understanding capabilities of humans and LLMs.

Plain English Explanation

Humans are incredibly skilled at understanding and reasoning about visual information. We can quickly recognize objects, understand complex scenes, and draw conclusions about what we see. However, it's not clear how well AI systems that are trained on both text and images (called "multimodal" AI) can match human visual cognition.

This paper investigates the "visual cognition gap" - the difference between how well humans and multimodal AI systems can process and make sense of visual information. The researchers developed a specialized benchmark test to evaluate and compare the visual understanding abilities of humans and these advanced AI models.

By using this benchmark, they were able to shed light on the strengths and limitations of multimodal AI systems when it comes to visual cognition. This helps us understand where these AI models excel, where they fall short, and how much further they need to go to truly match human-level visual understanding.

Technical Explanation

The paper introduces a Cognitive Evaluation Benchmark for Image Reasoning and Description, which consists of a series of tasks designed to assess different aspects of visual cognition. These include image classification, visual question answering, and image-to-text description generation.

The researchers evaluated several state-of-the-art multimodal LLMs, including MARVEL, BLINK, and ViCoR, on this benchmark and compared their performance to that of human participants.

The results showed that while the multimodal LLMs performed well on some tasks, they still lagged behind humans in terms of overall visual cognition. The models struggled with tasks that required deeper understanding of visual concepts, commonsense reasoning, and abstraction. The paper also identified specific areas where the models exhibited visual shortcomings.

Critical Analysis

The paper provides a comprehensive and well-designed evaluation of the visual cognition capabilities of multimodal LLMs. The benchmark tasks cover a range of visual understanding skills, allowing for a nuanced assessment of the models' strengths and weaknesses.

One potential limitation of the study is that it only evaluates a small number of multimodal LLMs, and the field of AI is rapidly advancing. It would be valuable to see the benchmark expanded to include a wider range of models, including newer and more advanced systems, to get a more complete picture of the current state of visual cognition in AI.

Additionally, the paper does not delve deeply into the specific architectural and training approaches that contribute to the observed performance gaps. Further research exploring the relationship between model design, training data, and visual cognition would be helpful in guiding the development of more capable multimodal AI systems.

Conclusion

This paper highlights the significant gap that still exists between human and machine visual cognition. While multimodal LLMs have made impressive strides in understanding and processing visual information, they fall short of matching the breadth and depth of human visual understanding.

The authors' Cognitive Evaluation Benchmark for Image Reasoning and Description provides a valuable tool for assessing and tracking the progress of multimodal AI systems in this domain. As the field of AI continues to advance, closing this visual cognition gap will be an important challenge to overcome in the pursuit of truly intelligent and capable artificial systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Xiujie Song, Mengyue Wu, Kenny Q. Zhu, Chunhao Zhang, Yanyi Chen

Large Vision-Language Models (LVLMs), despite their recent success, are hardly comprehensively tested for their cognitive abilities. Inspired by the prevalent use of the Cookie Theft task in human cognition test, we propose a novel evaluation benchmark to evaluate high-level cognitive ability of LVLMs using images with rich semantics. It defines eight reasoning capabilities and consists of an image description task and a visual question answering task. Our evaluation on well-known LVLMs shows that there is still a large gap in cognitive ability between LVLMs and humans.

6/17/2024

cs.AI cs.CL cs.CV

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Yifan Jiang, Jiarui Zhang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara

While multi-modal large language models (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only considered a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 by 3 matrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce MARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model accuracy is grounded in perception and reasoning, MARVEL complements the general AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with nine representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all models show near-random performance on the AVR question, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance) and even count the panels in the puzzle ( <45%), hindering their ability for abstract reasoning. We release our entire code and dataset.

4/26/2024

cs.CV cs.LG

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie

Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.

4/26/2024

cs.CV

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales

The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.

6/19/2024

cs.CV cs.AI cs.CL