CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

2406.05967

Published 6/11/2024 by David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja and 65 others

cs.CV cs.AI cs.CL cs.LG

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

Abstract

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

Create account to get full access

Overview

This paper introduces CVQA, a new culturally-diverse and multilingual visual question answering (VQA) benchmark.
CVQA aims to address the lack of diversity in existing VQA datasets, which predominantly feature images and questions from Western/English-speaking cultures.
The dataset contains over 130,000 image-question-answer triplets across 6 languages and multiple cultural backgrounds.

Plain English Explanation

The researchers who created this paper recognized that most existing visual question answering (VQA) datasets only include images and questions from Western or English-speaking cultures. To address this lack of diversity, they developed a new VQA benchmark called CVQA.

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark includes over 130,000 image-question-answer triplets spanning 6 different languages and multiple cultural backgrounds. The goal is to create a more inclusive and representative VQA dataset that can help train AI systems to understand and answer questions from a wider range of cultural perspectives.

This is an important step forward, as existing VQA datasets can lead to biases and blind spots in the AI models trained on them. By introducing more cultural diversity, the CVQA benchmark aims to push the field of VQA towards being more inclusive and representative of the global population.

Technical Explanation

The CVQA dataset was collected by crowd-sourcing image-question-answer triplets from diverse cultural backgrounds. They focused on 6 target languages - English, Chinese, Hindi, Arabic, Spanish, and French - and recruited annotators from corresponding cultural regions.

The images were carefully selected to depict a wide range of cultural scenes, objects, and activities. The annotators then wrote questions about these images that reflected their cultural perspectives and knowledge. Finally, they provided answers to the questions, creating the full image-question-answer triplets.

The resulting CVQA dataset contains over 130,000 such triplets, significantly larger than previous culturally-diverse VQA benchmarks. The researchers also provide baseline performance of state-of-the-art VQA models on this new dataset, showing room for improvement in handling the increased cultural diversity.

Critical Analysis

The CVQA dataset represents an important step towards more inclusive and representative visual question answering benchmarks. By incorporating diverse cultural perspectives, the dataset can help push the field of VQA to be more globally aware and less biased towards Western norms.

However, the dataset is still limited to 6 target languages, and may not fully capture the extreme diversity of the world's cultures and languages. There is room for even broader representation in future iterations of the benchmark.

Additionally, the performance of current VQA models on CVQA is still relatively low, suggesting significant challenges in generalizing to culturally diverse data. More research is needed to develop VQA systems that can truly understand and reason about visual information from a global perspective.

Conclusion

The CVQA benchmark introduced in this paper is a valuable contribution to the field of visual question answering. By incorporating cultural diversity across languages and backgrounds, it aims to make VQA systems more inclusive and representative of the global population.

While not perfect, CVQA represents an important step forward in pushing the VQA field to be more aware of its biases and limitations. Continued research and development on this benchmark has the potential to lead to AI systems that can understand and engage with visual information from a truly international and cross-cultural perspective.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial visual-textual misalignment problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still a large room for performance improvement, underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.

6/12/2024

cs.CV

Towards Multilingual Audio-Visual Question Answering

Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

6/14/2024

cs.LG cs.CV cs.MM cs.SD eess.AS

🤔

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

Yoonsik Kim, Moonbin Yim, Ka Yeon Song

In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at href{https://github.com/naver-ai/tablevqabench}{https://github.com/naver-ai/tablevqabench}.

5/1/2024

cs.CV cs.AI

📉

KNVQA: A Benchmark for evaluation knowledge-based VQA

Sirui Cheng, Siyu Zhang, Jiayi Wu, Muchen Lan

Within the multimodal field, large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. However, LVLMs are still plagued by the two critical issues of object hallucination and factual accuracy, which limit the practicality of LVLMs in different scenarios. Furthermore, previous evaluation methods focus more on the comprehension and reasoning of language content but lack a comprehensive evaluation of multimodal interactions, thereby resulting in potential limitations. To this end, we propose a novel KNVQA-Eval, which is devoted to knowledge-based VQA task evaluation to reflect the factuality of multimodal LVLMs. To ensure the robustness and scalability of the evaluation, we develop a new KNVQA dataset by incorporating human judgment and perception, aiming to evaluate the accuracy of standard answers relative to AI-generated answers in knowledge-based VQA. This work not only comprehensively evaluates the contextual information of LVLMs using reliable human annotations, but also further analyzes the fine-grained capabilities of current methods to reveal potential avenues for subsequent optimization of LVLMs-based estimators. Our proposed VQA-Eval and corresponding dataset KNVQA will facilitate the development of automatic evaluation tools with the advantages of low cost, privacy protection, and reproducibility. Our code will be released upon publication.

6/14/2024

cs.CV cs.AI