NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Read original: arXiv:2407.10380 - Published 7/16/2024 by Pranshu Pandya, Agney S Talwarr, Vatsal Gupta, Tushar Kataria, Vivek Gupta, Dan Roth

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Introduction

NTSEBench is a new benchmark for evaluating the cognitive reasoning abilities of vision-language models. This benchmark is designed to assess a model's understanding of natural language and visual information, as well as its ability to combine this knowledge to solve complex reasoning tasks.

NTSEBench

Overview

NTSEBench consists of a diverse set of tasks that test a model's reasoning skills across various cognitive domains, including:
The benchmark aims to bridge the "visual cognition gap" between humans and current vision-language models, which struggle with complex reasoning tasks that require integrating visual and linguistic information.

Plain English Explanation

NTSEBench is a new tool for testing the reasoning abilities of AI models that can understand both images and text. It presents the models with a variety of challenging tasks that require them to combine their knowledge of language and visual information to solve complex problems. This is important because current AI models often struggle with tasks that come easily to humans, like understanding causal relationships or making logical inferences. By evaluating models on a diverse set of reasoning skills, NTSEBench can help identify the strengths and weaknesses of these systems and guide future research to develop more human-like cognitive abilities.

Technical Explanation

NTSEBench consists of several task categories that assess different aspects of a model's reasoning skills. The logical reasoning tasks evaluate a model's ability to draw valid conclusions from given premises, while the analogical reasoning tasks test its capacity to identify relevant similarities and differences between concepts. The causal reasoning tasks challenge the model to understand the underlying mechanisms driving observed events, and the commonsense reasoning tasks assess its grasp of everyday knowledge and intuitions.

The benchmark is designed to be a comprehensive evaluation of a model's cognitive capabilities, going beyond traditional benchmarks that focus on narrow skills or specific applications. By requiring models to integrate visual and linguistic information to solve complex problems, NTSEBench aims to bridge the "visual cognition gap" between human and machine intelligence.

Critical Analysis

The authors acknowledge that NTSEBench is not without its limitations. The benchmark tasks may not fully capture the nuances of human reasoning, and the evaluation metrics may not always align with real-world performance. Additionally, the dataset used to construct the benchmark may contain biases or other artifacts that could influence a model's performance.

Furthermore, the paper does not address the potential ethical implications of developing more advanced cognitive reasoning capabilities in AI systems. As these models become more capable of human-like reasoning, there may be concerns about their potential misuse or unintended consequences.

Conclusion

NTSEBench represents an important step forward in the evaluation of vision-language models, providing a comprehensive assessment of their cognitive reasoning abilities. By challenging models to integrate visual and linguistic information to solve complex problems, the benchmark can help identify the strengths and weaknesses of these systems and guide future research towards more human-like intelligence. However, the limitations and potential ethical considerations of this technology should be carefully considered as the field continues to evolve.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Pranshu Pandya, Agney S Talwarr, Vatsal Gupta, Tushar Kataria, Vivek Gupta, Dan Roth

Cognitive textual and visual reasoning tasks, such as puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. While LLMs and VLMs, through extensive training on large amounts of human-curated data, have attained a high level of pseudo-human intelligence in some common sense reasoning tasks, they still struggle with more complex reasoning tasks that require cognitive understanding. In this work, we introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India, featuring both visual and textual general aptitude questions that do not rely on rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities (text and images) in the dataset instances.

7/16/2024

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Xiujie Song, Mengyue Wu, Kenny Q. Zhu, Chunhao Zhang, Yanyi Chen

Large Vision-Language Models (LVLMs), despite their recent success, are hardly comprehensively tested for their cognitive abilities. Inspired by the prevalent use of the Cookie Theft task in human cognition test, we propose a novel evaluation benchmark to evaluate high-level cognitive ability of LVLMs using images with rich semantics. It defines eight reasoning capabilities and consists of an image description task and a visual question answering task. Our evaluation on well-known LVLMs shows that there is still a large gap in cognitive ability between LVLMs and humans.

6/17/2024

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao, Edward Sun, Tianyu Liu, Wei Wang

We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at https://github.com/Yijia-Xiao/LogicVista.

7/9/2024

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis

Aishik Nagar, Shantanu Jaiswal, Cheston Tan

Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering (VQA) benchmarks, alluding to their capabilities as visual reasoning engines. However, the benchmarks being used conflate pure visual reasoning with world knowledge, and also have questions that involve a limited number of reasoning steps. Thus, it remains unclear whether a VLM's apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities. To clarify this ambiguity, we systematically benchmark and dissect the zero-shot visual reasoning capabilities of VLMs through synthetic datasets that require minimal world knowledge, and allow for analysis over a broad range of reasoning steps. We focus on two novel aspects of zero-shot visual reasoning: i) evaluating the impact of conveying scene information as either visual embeddings or purely textual scene descriptions to the underlying large language model (LLM) of the VLM, and ii) comparing the effectiveness of chain-of-thought prompting to standard prompting for zero-shot visual reasoning. We find that the underlying LLMs, when provided textual scene descriptions, consistently perform better compared to being provided visual embeddings. In particular, 18% higher accuracy is achieved on the PTR dataset. We also find that CoT prompting performs marginally better than standard prompting only for the comparatively large GPT-3.5-Turbo (175B) model, and does worse for smaller-scale models. This suggests the emergence of CoT abilities for visual reasoning in LLMs at larger scales even when world knowledge is limited. Overall, we find limitations in the abilities of VLMs and LLMs for more complex visual reasoning, and highlight the important role that LLMs can play in visual reasoning.

9/4/2024