ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

2404.10652

Published 4/17/2024 by Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

cs.CL

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

Abstract

Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (textbf{Vi}etnamese textbf{Text}-based textbf{V}isual textbf{Q}uestion textbf{A}nswering dataset) which contains textbf{over 16,000} images and textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this href{https://github.com/minhquan6203/ViTextVQA-Dataset}{link} for research purposes.

Create account to get full access

Overview

• This paper introduces ViTextVQA, a large-scale visual question answering (VQA) dataset for evaluating Vietnamese text comprehension in images.

• The dataset contains over 100,000 questions about Vietnamese text in images, covering a diverse range of topics and scenarios.

• The goal of ViTextVQA is to advance research on multimodal understanding and text-based visual reasoning for the Vietnamese language.

Plain English Explanation

ViTextVQA is a new dataset that aims to improve how computers can understand and answer questions about Vietnamese text in images. The dataset contains over 100,000 questions about Vietnamese words, signs, and other text that appear in pictures. These questions cover a wide variety of topics, like navigation, shopping, and daily life.

The key idea behind ViTextVQA is to push the boundaries of visual question answering - the ability for AI systems to analyze an image and answer questions about it. By focusing on Vietnamese text, ViTextVQA provides a new challenge that goes beyond just recognizing objects in an image. It requires the AI to also comprehend the meaning and context of the Vietnamese words and phrases.

Developing strong Vietnamese VQA models could lead to better interfaces for Vietnamese-speaking users, such as more accurate translation apps or voice assistants that can understand Vietnamese text in the real world. It could also contribute to advances in multimodal AI that can fluidly combine visual and textual information.

Technical Explanation

ViTextVQA is a large-scale visual question answering (VQA) dataset focused on evaluating Vietnamese text comprehension in images. The dataset contains over 100,000 questions about Vietnamese words, signs, and other text that appear in over 25,000 diverse images.

The questions in ViTextVQA cover a wide range of topics, including navigation, shopping, education, and daily life. Each question is accompanied by a ground-truth answer, as well as additional metadata like the type of Vietnamese text in the image and the difficulty level of the question.

The authors propose ViTextVQA as a benchmark for assessing the ability of AI systems to perform text-based visual reasoning for the Vietnamese language. Compared to existing VQA datasets, ViTextVQA introduces new challenges such as handling diverse Vietnamese scripts, understanding contextual meanings, and dealing with abstention when the visual information alone is insufficient to answer a question.

The authors also provide several strong baseline models for ViTextVQA, including transformer-based architectures fine-tuned on the dataset. These models demonstrate the difficulty of the task, achieving only modest performance, suggesting significant room for improvement.

Critical Analysis

ViTextVQA represents an important step forward in multimodal AI research, as it focuses on a language and modality that have been underexplored compared to English and other widely studied languages. The diversity of the dataset, in terms of both visual content and question types, is a strength that should stimulate innovative approaches.

However, the dataset is limited to a single language, Vietnamese, which may restrict its broader applicability. Expanding ViTextVQA to include other languages, or developing similar datasets for other underrepresented languages, could further advance the field of multilingual and cross-cultural visual understanding.

Additionally, the authors note that the dataset may exhibit biases, as the images and questions were collected from online sources. Careful analysis and mitigation of such biases will be crucial to ensure the fair and ethical development of VQA systems.

Finally, the relatively low performance of the baseline models suggests that ViTextVQA poses significant challenges that current AI techniques have yet to master. Continued research and innovation will be needed to push the boundaries of text-based visual reasoning, particularly for less resourced languages like Vietnamese.

Conclusion

ViTextVQA is a valuable new dataset that aims to advance research on multimodal understanding and text-based visual reasoning for the Vietnamese language. By providing a large-scale benchmark with diverse questions and images, the dataset presents new challenges that go beyond traditional object recognition tasks.

The development of effective ViTextVQA models could lead to improved user experiences for Vietnamese-speaking audiences, such as more accurate translation apps or voice assistants that can understand Vietnamese text in the real world. More broadly, this research could contribute to the development of more versatile and capable multimodal AI systems that can fluidly combine visual and textual information.

While the current baseline results suggest significant room for improvement, the ViTextVQA dataset represents an important step forward in the field of multilingual and cross-cultural visual understanding. Continued research and innovation in this area could yield valuable insights and AI capabilities that benefit diverse communities around the world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset. Furthermore, we introduce a novel approach, called VisionReader, which achieved 0.4116 in EM and 0.6990 in the F1-score on the test set. Through the results, we found that the OCR system plays a very important role in VQA models on the ViOCRVQA dataset. In addition, the objects in the image also play a role in improving model performance. We open access to our dataset at link (https://github.com/qhnhynmm/ViOCRVQA.git) for further research in OCR-VQA task in Vietnamese.

4/30/2024

cs.CV

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song, Henok Biadglign Ademtew, Hern'an Maina, Holy Lovenia, Israel Abebe Azime, Jan Christian Blaise Cruz, Jay Gala, Jiahui Geng, Jesus-German Ortiz-Barajas, Jinheon Baek, Jocelyn Dunstan, Laura Alonso Alemany, Kumaranage Ravindu Yasas Nagasinghe, Luciana Benotti, Luis Fernando D'Haro, Marcelo Viridiano, Marcos Estecha-Garitagoitia, Maria Camila Buitrago Cabrera, Mario Rodr'iguez-Cantelar, M'elanie Jouitteau, Mihail Mihaylov, Mohamed Fazli Mohamed Imam, Muhammad Farid Adilazuarda, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Naome Etori, Olivier Niyomugisha, Paula M'onica Silva, Pranjal Chitale, Raj Dabre, Rendi Chevi, Ruochen Zhang, Ryandito Diandaru, Samuel Cahyawijaya, Santiago G'ongora, Soyeong Jeong, Sukannya Purkayastha, Tatsuki Kuribayashi, Thanmay Jayakumar, Tiago Timponi Torrent, Toqeer Ehsan, Vladimir Araujo, Yova Kementchedjhieva, Zara Burzo, Zheng Wei Lim, Zheng Xin Yong, Oana Ignat, Joan Nwatu, Rada Mihalcea, Thamar Solorio, Alham Fikri Aji

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

6/11/2024

cs.CV cs.AI cs.CL cs.LG

👀

VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube -- an extensive source of user-uploaded content, covering the topics of food and travel. By capturing the spoken language of native Vietnamese speakers in natural settings, an obscure corner overlooked in Vietnamese research, the corpus provides a valuable resource for future research in reading comprehension tasks for the Vietnamese language. Regarding performance evaluation, our deep-learning models achieved the highest F1 score of 75.34% on the test set, indicating significant progress in machine reading comprehension for Vietnamese spoken language data. In terms of EM, the highest score we accomplished is 53.97%, which reflects the challenge in processing spoken-based content and highlights the need for further improvement.

4/9/2024

cs.CL

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial visual-textual misalignment problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still a large room for performance improvement, underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.

6/12/2024

cs.CV