IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Read original: arXiv:2403.15952 - Published 8/12/2024 by Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Overview

The paper introduces IllusionVQA, a challenging dataset of optical illusions for evaluating vision-language models.
The dataset contains a diverse set of visual illusions with corresponding questions that test a model's ability to reason about perceptual ambiguities.
The paper presents an in-depth analysis of the dataset and evaluates the performance of state-of-the-art vision-language models on the task.

Plain English Explanation

The researchers created a new dataset called IllusionVQA that contains optical illusions accompanied by questions. Optical illusions are images that trick our eyes and minds into perceiving something different from reality. The researchers wanted to use these types of images to test how well vision-language models (AI systems that can understand both images and text) can reason about perceptual ambiguities.

The IllusionVQA dataset includes a wide variety of optical illusions, such as geometric, motion, and brightness illusions. For each image, there are several questions that probe the model's understanding of the illusion. For example, a question might ask "Which shape appears larger?" for an image showing the Ebbinghaus illusion, where two circles of the same size appear different in size due to the surrounding circles.

By evaluating how well current vision-language models perform on this challenging dataset, the researchers aim to shed light on the limitations of these models when it comes to reasoning about perceptual ambiguities. This can help guide future research and development of more robust and capable AI systems.

Technical Explanation

The IllusionVQA dataset consists of 3,500 optical illusion images sourced from various online resources, each accompanied by 5 multiple-choice questions. The questions probe different aspects of the illusions, such as identifying the perceived versus actual sizes of objects, the directions of motion, and the relative brightness of regions.

The researchers evaluated the performance of several state-of-the-art vision-language models, including LXMERT, VisualBERT, and UNITER, on the IllusionVQA dataset. The models were trained on a large-scale image-text dataset (e.g., COCO) and then fine-tuned on the IllusionVQA training set.

The results show that these models struggle to accurately answer the questions, achieving only around 50% accuracy on average. The models perform better on some types of illusions (e.g., geometric) compared to others (e.g., brightness). The researchers attribute this to the models' limitations in understanding perceptual ambiguities and reasoning about the underlying physical principles that govern the illusions.

Critical Analysis

The IllusionVQA dataset presents a valuable and novel benchmark for evaluating the capabilities of vision-language models. Optical illusions pose a unique challenge for these models, as they require higher-level reasoning about the discrepancy between perceived and actual properties of the visual world.

While the current state-of-the-art models perform poorly on the dataset, this finding is important for highlighting the need for further research and development in this area. The authors acknowledge that the dataset is challenging and may require advances in areas such as causal reasoning, multimodal reasoning, and cultural understanding to achieve better performance.

One potential limitation of the dataset is the relatively small size, which may limit the ability of large models to learn the necessary generalizations. Additionally, the researchers note that the dataset may not capture the full diversity of optical illusions, and further expansion could be valuable.

Conclusion

The IllusionVQA dataset represents an important step towards evaluating the robustness and generalization capabilities of vision-language models. By focusing on optical illusions, the dataset challenges these models to move beyond simple perceptual tasks and engage in more nuanced reasoning about the visual world.

The poor performance of current state-of-the-art models on this benchmark highlights the need for continued advancements in areas like causal reasoning, multimodal reasoning, and cultural understanding to develop more capable and reliable AI systems. The IllusionVQA dataset provides a valuable benchmark for driving progress in this direction and furthering the development of robust, human-like machine perception and reasoning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

8/12/2024

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io

7/29/2024

👀

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Jesse Atuhurra, Iqra Ali, Tatsuya Hiraoka, Hidetaka Kamigaito, Tomoya Iwakura, Taro Watanabe

Large language models (LLMs) have increased interest in vision language models (VLMs), which process image-text pairs as input. Studies investigating the visual understanding ability of VLMs have been proposed, but such studies are still preliminary because existing datasets do not permit a comprehensive evaluation of the fine-grained visual linguistic abilities of VLMs across multiple languages. To further explore the strengths of VLMs, such as GPT-4V cite{openai2023GPT4}, we developed new datasets for the systematic and qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing textit{questions} and prompting GPT4-V to generate the textit{answers} and the textit{rationales}, 2) introduced a new VL task named textit{unrelatedness}, 3) introduced rationales to enable human understanding of the VLM reasoning process, and 4) employed human evaluation to measure the suitability of proposed datasets for VL tasks. We show that VLMs can be fine-tuned on our datasets. Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces textit{rationales} in VL analysis, which played a vital role in the evaluation.

6/26/2024

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, Golnaz Ghiasi

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a fine-tuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r=0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r=0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

7/23/2024