KNVQA: A Benchmark for evaluation knowledge-based VQA

2311.12639

Published 6/14/2024 by Sirui Cheng, Siyu Zhang, Jiayi Wu, Muchen Lan

📉

Abstract

Within the multimodal field, large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. However, LVLMs are still plagued by the two critical issues of object hallucination and factual accuracy, which limit the practicality of LVLMs in different scenarios. Furthermore, previous evaluation methods focus more on the comprehension and reasoning of language content but lack a comprehensive evaluation of multimodal interactions, thereby resulting in potential limitations. To this end, we propose a novel KNVQA-Eval, which is devoted to knowledge-based VQA task evaluation to reflect the factuality of multimodal LVLMs. To ensure the robustness and scalability of the evaluation, we develop a new KNVQA dataset by incorporating human judgment and perception, aiming to evaluate the accuracy of standard answers relative to AI-generated answers in knowledge-based VQA. This work not only comprehensively evaluates the contextual information of LVLMs using reliable human annotations, but also further analyzes the fine-grained capabilities of current methods to reveal potential avenues for subsequent optimization of LVLMs-based estimators. Our proposed VQA-Eval and corresponding dataset KNVQA will facilitate the development of automatic evaluation tools with the advantages of low cost, privacy protection, and reproducibility. Our code will be released upon publication.

Create account to get full access

Overview

This paper explores the critical issues of object hallucination and factual accuracy that still plague large vision-language models (LVLMs) in the multimodal field.
The authors propose a novel evaluation framework called KNVQA-Eval to assess the factuality of LVLMs using a knowledge-based visual question answering (VQA) task.
They also introduce a new KNVQA dataset with human-annotated answers to enable a comprehensive evaluation of multimodal interactions.

Plain English Explanation

Large vision-language models (LVLMs) are powerful AI systems that can understand and process both visual and textual information. These models have made significant progress in the multimodal field, which involves combining visual and language data.

However, LVLMs still struggle with two key issues: object hallucination and factual accuracy. Object hallucination is when the model imagines objects that are not actually present in the image. Factual accuracy is when the model provides information that is incorrect or not supported by the facts.

These limitations prevent LVLMs from being used reliably in real-world applications. Previous evaluation methods have focused more on testing the models' language understanding and reasoning, but they don't thoroughly assess how well the models can handle multimodal interactions.

To address this gap, the researchers propose a new evaluation framework called KNVQA-Eval. This framework uses a knowledge-based visual question answering (VQA) task to test the factuality of LVLMs. They also develop a new KNVQA dataset with human-annotated answers to ensure the evaluation is reliable and comprehensive.

By using this new evaluation approach, the researchers can better understand the current capabilities and limitations of LVLMs. This knowledge can then guide efforts to improve these models and make them more practical for real-world applications.

Technical Explanation

The researchers first identify the two critical issues plaguing large vision-language models (LVLMs): object hallucination and factual accuracy. Object hallucination is when the model generates objects that are not present in the image, while factual accuracy refers to the model's ability to provide truthful information based on the visual and textual inputs.

To address these limitations, the authors propose a novel KNVQA-Eval framework for evaluating the factuality of LVLMs. This framework focuses on a knowledge-based visual question answering (VQA) task, where the model must answer questions about the factual content of an image.

To support this evaluation, the researchers develop a new KNVQA dataset that incorporates human judgments and perceptions. The dataset is designed to assess the accuracy of AI-generated answers compared to human-provided standard answers in the knowledge-based VQA task.

This comprehensive evaluation approach not only assesses the contextual information of LVLMs using reliable human annotations, but also analyzes the fine-grained capabilities of current methods. This can reveal potential avenues for subsequent optimization of LVLM-based estimators.

The authors emphasize that the proposed KNVQA-Eval framework and corresponding KNVQA dataset will facilitate the development of automatic evaluation tools with advantages such as low cost, privacy protection, and reproducibility.

Critical Analysis

The researchers have identified a crucial area for improvement in large vision-language models (LVLMs): their factual accuracy and ability to avoid object hallucination. These limitations currently prevent LVLMs from being used reliably in many real-world applications.

The proposed KNVQA-Eval framework and KNVQA dataset address this gap by providing a comprehensive evaluation of multimodal interactions. This approach could shed light on the specific strengths and weaknesses of current LVLM methods, guiding future optimization efforts.

However, the authors acknowledge that their evaluation is limited to knowledge-based VQA tasks. While this is an important aspect of LVLM performance, other multimodal tasks, such as visual question answering or image captioning, may reveal different challenges or capabilities of these models.

Additionally, the KNVQA dataset, while designed to be robust and scalable, may still be subject to inherent biases or limitations in the human annotations. Ongoing efforts to diversify and expand multimodal datasets could further enhance the reliability and generalizability of LVLM evaluations.

Conclusion

This research highlights the critical need to address the issues of object hallucination and factual accuracy in large vision-language models (LVLMs). The proposed KNVQA-Eval framework and KNVQA dataset provide a valuable tool for comprehensively evaluating the multimodal capabilities of these models.

By focusing on knowledge-based visual question answering, the researchers can gain deeper insights into the factual accuracy of LVLMs, which is essential for their practical deployment in real-world applications. The findings from this work can guide future efforts to optimize and refine these powerful AI systems, ultimately enhancing their reliability and usefulness across a wide range of domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👁️

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this paper, we introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark is collected from 73 different medical datasets, including 12 different modalities and covering more than 20 distinct anatomical regions. Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through our extensive experiments, we have found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover, what surprises us is that medical-specialized LVLMs even exhibit inferior performance to those general-domain models, calling for a more versatile and robust LVLM in the biomedical field. The evaluation results not only reveal the current limitations of LVLM in understanding real medical images but also highlight our dataset's significance. Our code with dataset are available at https://github.com/OpenGVLab/Multi-Modality-Arena.

4/23/2024

eess.IV cs.CV

🏷️

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Simon Ging, Mar'ia A. Bravo, Thomas Brox

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

5/7/2024

cs.CV cs.CL cs.LG

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Yunxin Li, Xinyu Chen, Baotian Hu, Haoyuan Shi, Min Zhang

Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper

6/27/2024

cs.CL cs.CV

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song, Henok Biadglign Ademtew, Hern'an Maina, Holy Lovenia, Israel Abebe Azime, Jan Christian Blaise Cruz, Jay Gala, Jiahui Geng, Jesus-German Ortiz-Barajas, Jinheon Baek, Jocelyn Dunstan, Laura Alonso Alemany, Kumaranage Ravindu Yasas Nagasinghe, Luciana Benotti, Luis Fernando D'Haro, Marcelo Viridiano, Marcos Estecha-Garitagoitia, Maria Camila Buitrago Cabrera, Mario Rodr'iguez-Cantelar, M'elanie Jouitteau, Mihail Mihaylov, Mohamed Fazli Mohamed Imam, Muhammad Farid Adilazuarda, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Naome Etori, Olivier Niyomugisha, Paula M'onica Silva, Pranjal Chitale, Raj Dabre, Rendi Chevi, Ruochen Zhang, Ryandito Diandaru, Samuel Cahyawijaya, Santiago G'ongora, Soyeong Jeong, Sukannya Purkayastha, Tatsuki Kuribayashi, Thanmay Jayakumar, Tiago Timponi Torrent, Toqeer Ehsan, Vladimir Araujo, Yova Kementchedjhieva, Zara Burzo, Zheng Wei Lim, Zheng Xin Yong, Oana Ignat, Joan Nwatu, Rada Mihalcea, Thamar Solorio, Alham Fikri Aji

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

6/11/2024

cs.CV cs.AI cs.CL cs.LG