ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

2404.18397

Published 4/30/2024 by Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

cs.CV

👀

Abstract

Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset. Furthermore, we introduce a novel approach, called VisionReader, which achieved 0.4116 in EM and 0.6990 in the F1-score on the test set. Through the results, we found that the OCR system plays a very important role in VQA models on the ViOCRVQA dataset. In addition, the objects in the image also play a role in improving model performance. We open access to our dataset at link (https://github.com/qhnhynmm/ViOCRVQA.git) for further research in OCR-VQA task in Vietnamese.

Create account to get full access

Overview

This paper introduces a new dataset called ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering) for the task of answering questions about text information contained in images, specifically for the Vietnamese language.
The dataset contains over 28,000 images and 120,000 question-answer pairs, with the images containing text that the questions are based on.
The authors deploy state-of-the-art methods proposed for English to conduct experiments on their Vietnamese dataset, revealing the challenges and difficulties in working with a low-resource language like Vietnamese.
They also introduce a novel approach called VisionReader, which achieves strong performance on the ViOCRVQA dataset.

Plain English Explanation

The paper focuses on a task called Optical Character Recognition - Visual Question Answering (OCR-VQA), which involves answering questions about text information contained in images. This task has been well-studied for the English language, but there has been limited research for low-resource languages like Vietnamese.

To address this gap, the researchers created a new dataset called ViOCRVQA, which contains over 28,000 images and 120,000 question-answer pairs. The images in this dataset all have text, and the questions are about the information in that text.

The researchers then tried applying state-of-the-art methods developed for English OCR-VQA to their Vietnamese dataset. This helped them understand the unique challenges and difficulties involved in working with a low-resource language like Vietnamese.

To tackle these challenges, the researchers developed a new approach called VisionReader, which performed well on the ViOCRVQA dataset. Their results showed that the OCR (Optical Character Recognition) system plays a crucial role in the performance of VQA (Visual Question Answering) models on this Vietnamese dataset. Additionally, the objects in the images also contribute to improving the model's performance.

The researchers have made the ViOCRVQA dataset publicly available to encourage further research in the area of OCR-VQA for Vietnamese and other low-resource languages.

Technical Explanation

The authors introduce a novel dataset called ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering) for the task of answering questions about text information contained in images. The dataset consists of over 28,000 images and 120,000 question-answer pairs, where the images all contain text, and the questions are based on the information in that text.

To evaluate the performance of existing methods on this Vietnamese dataset, the authors deploy ideas from state-of-the-art approaches proposed for English OCR-VQA tasks. This allows them to understand the unique challenges and difficulties inherent in working with a low-resource language like Vietnamese.

Furthermore, the authors introduce a novel approach called VisionReader, which achieved strong results on the ViOCRVQA dataset, with an Exact Match (EM) score of 0.4116 and an F1-score of 0.6990 on the test set. Their analysis reveals that the OCR system plays a crucial role in the performance of VQA models on this Vietnamese dataset. They also find that the objects in the image contribute to improving the model's performance.

The authors have made the ViOCRVQA dataset publicly available at the provided link to encourage further research in the area of OCR-VQA for Vietnamese and other low-resource languages.

Critical Analysis

The paper introduces a valuable dataset and approach for the task of OCR-VQA in Vietnamese, a low-resource language. By deploying state-of-the-art methods developed for English and introducing a novel VisionReader approach, the authors have made important strides in addressing the challenges of this task for Vietnamese.

However, the paper does not provide a detailed analysis of the limitations of the VisionReader approach or the ViOCRVQA dataset. It would be helpful to understand the types of errors or biases in the dataset, as well as the potential weaknesses of the VisionReader model that could be addressed in future work.

Additionally, the paper could have compared the performance of the VisionReader approach to other existing methods for Vietnamese OCR-VQA, such as the ones mentioned in related work like VLoGQA or PDF-MVQA. This would give readers a better sense of how the VisionReader approach stands compared to the state-of-the-art for this task in Vietnamese.

Furthermore, the paper could have discussed the potential applications and societal implications of a robust OCR-VQA system for Vietnamese, and how this research could be extended to other low-resource languages. Enhancing Visual Question Answering through Question-Driven Attention and Fusion of Domain-Adapted Vision-Language Models for Medical Image Understanding are examples of related work that could provide inspiration for future research directions.

Conclusion

This paper presents a valuable contribution to the field of OCR-VQA for low-resource languages, specifically Vietnamese. By introducing the ViOCRVQA dataset and the VisionReader approach, the authors have taken important steps towards addressing the challenges in this task for Vietnamese.

The results showcase the crucial role of the OCR system and the importance of object information in improving the performance of VQA models on the ViOCRVQA dataset. The public release of the dataset will undoubtedly spur further research in this area, which could lead to advancements in OCR-VQA for a wide range of low-resource languages and potentially have far-reaching societal applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (textbf{Vi}etnamese textbf{Text}-based textbf{V}isual textbf{Q}uestion textbf{A}nswering dataset) which contains textbf{over 16,000} images and textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this href{https://github.com/minhquan6203/ViTextVQA-Dataset}{link} for research purposes.

4/17/2024

cs.CL

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song, Henok Biadglign Ademtew, Hern'an Maina, Holy Lovenia, Israel Abebe Azime, Jan Christian Blaise Cruz, Jay Gala, Jiahui Geng, Jesus-German Ortiz-Barajas, Jinheon Baek, Jocelyn Dunstan, Laura Alonso Alemany, Kumaranage Ravindu Yasas Nagasinghe, Luciana Benotti, Luis Fernando D'Haro, Marcelo Viridiano, Marcos Estecha-Garitagoitia, Maria Camila Buitrago Cabrera, Mario Rodr'iguez-Cantelar, M'elanie Jouitteau, Mihail Mihaylov, Mohamed Fazli Mohamed Imam, Muhammad Farid Adilazuarda, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Naome Etori, Olivier Niyomugisha, Paula M'onica Silva, Pranjal Chitale, Raj Dabre, Rendi Chevi, Ruochen Zhang, Ryandito Diandaru, Samuel Cahyawijaya, Santiago G'ongora, Soyeong Jeong, Sukannya Purkayastha, Tatsuki Kuribayashi, Thanmay Jayakumar, Tiago Timponi Torrent, Toqeer Ehsan, Vladimir Araujo, Yova Kementchedjhieva, Zara Burzo, Zheng Wei Lim, Zheng Xin Yong, Oana Ignat, Joan Nwatu, Rada Mihalcea, Thamar Solorio, Alham Fikri Aji

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

6/11/2024

cs.CV cs.AI cs.CL cs.LG

👀

VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube -- an extensive source of user-uploaded content, covering the topics of food and travel. By capturing the spoken language of native Vietnamese speakers in natural settings, an obscure corner overlooked in Vietnamese research, the corpus provides a valuable resource for future research in reading comprehension tasks for the Vietnamese language. Regarding performance evaluation, our deep-learning models achieved the highest F1 score of 75.34% on the test set, indicating significant progress in machine reading comprehension for Vietnamese spoken language data. In terms of EM, the highest score we accomplished is 53.97%, which reflects the challenge in processing spoken-based content and highlights the need for further improvement.

4/9/2024

cs.CL

🌿

Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering

Hiba Maryam, Ling Fu, Jiajun Song, Tajrian ABM Shafayet, Qidi Luo, Xiang Bai, Yuliang Liu

The development of Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies is crucial for advancing accessibility, information retrieval, and linguistic diversity in digital content, facilitating better understanding and interaction with Urdu-language visual data. This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images, which can be used for text detection, recognition, and VQA tasks. We provide fine-grained annotations for text instances, addressing the limitations of previous datasets for facing arbitrary-shaped texts. By incorporating additional annotation points, this dataset facilitates the development and assessment of methods that can handle diverse text layouts, intricate shapes, and non-standard orientations commonly encountered in real-world scenarios. Besides, the VQA annotations make it the first benchmark for the Urdu Text VQA method, which can prompt the development of Urdu scene text understanding. The proposed dataset is available at: https://github.com/Hiba-MeiRuan/Urdu-VQA-Dataset-/tree/main

5/22/2024

cs.CV