VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

2402.02655

YC

0

Reddit

0

Published 4/9/2024 by Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

šŸ‘€

Abstract

This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube -- an extensive source of user-uploaded content, covering the topics of food and travel. By capturing the spoken language of native Vietnamese speakers in natural settings, an obscure corner overlooked in Vietnamese research, the corpus provides a valuable resource for future research in reading comprehension tasks for the Vietnamese language. Regarding performance evaluation, our deep-learning models achieved the highest F1 score of 75.34% on the test set, indicating significant progress in machine reading comprehension for Vietnamese spoken language data. In terms of EM, the highest score we accomplished is 53.97%, which reflects the challenge in processing spoken-based content and highlights the need for further improvement.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents the development of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks.
  • The corpus, called VlogQA, consists of 10,076 question-answer pairs based on 1,230 transcript documents from YouTube videos on food and travel topics.
  • The authors aimed to capture the spoken language of native Vietnamese speakers in natural settings, which is often overlooked in Vietnamese research.
  • The paper also provides insights into the challenges and opportunities associated with using real-world data for MRC tasks.

Plain English Explanation

The researchers created a new dataset called VlogQA to help computers better understand Vietnamese language as it is actually spoken by people. Most existing datasets for training machines to comprehend Vietnamese language are based on formal written sources like Wikipedia or news articles. In contrast, VlogQA uses transcripts of YouTube videos where people are speaking naturally about topics like food and travel.

By capturing this more conversational, real-world Vietnamese language, the researchers hope to advance machine reading comprehension capabilities for the Vietnamese language. Their deep learning models achieved promising results, with an F1 score of 75.34% on the test set. However, they also note that processing spoken language content remains challenging, as reflected in the lower exact match (EM) score of 53.97%. This highlights the need for further research and improvements in this area.

Technical Explanation

The VlogQA corpus was developed by the researchers to address the lack of spoken language datasets for Vietnamese machine reading comprehension tasks. Existing MRC datasets in Vietnamese mainly focus on formal written sources like Wikipedia articles, online newspapers, or textbooks. In contrast, VlogQA uses transcripts from a diverse set of 1,230 YouTube videos covering food and travel topics, resulting in 10,076 question-answer pairs.

To evaluate the performance of MRC models on this spoken language data, the researchers trained several deep learning architectures, including TinyVQA and ComuniQA. The best-performing model achieved an F1 score of 75.34% on the test set, indicating significant progress in Vietnamese MRC for spoken language data. However, the exact match (EM) score of 53.97% reflects the challenges associated with processing conversational content, which the researchers suggest requires further research and improvement, as seen in similar work on expanding large language models for spoken language understanding and enhancing Persian conversational question answering.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their work. While the VlogQA dataset provides a valuable resource for studying Vietnamese spoken language, it is limited to a specific domain (food and travel) and may not capture the full breadth of conversational topics. Additionally, the dataset is derived from YouTube transcripts, which can be noisy and may not perfectly represent natural spoken language.

The performance metrics, while promising, also highlight the difficulties in processing spoken language content for MRC tasks. The relatively lower exact match score compared to the F1 score suggests that the models are still struggling to fully comprehend the nuances and context of the spoken language. Further research is needed to develop more robust and adaptable MRC models that can handle the challenges of spoken language, such as disfluencies, accents, and contextual references.

It would also be valuable for the researchers to explore the transferability of the models trained on VlogQA to other spoken language domains or tasks, as well as to compare the performance of their models to human benchmarks on the same dataset. This could provide additional insights into the current state of Vietnamese MRC and guide future research directions.

Conclusion

This paper presents the development of the VlogQA dataset, a valuable resource for advancing machine reading comprehension of Vietnamese spoken language. By using real-world data from YouTube transcripts, the researchers have captured the natural, conversational nature of the language, which is often overlooked in existing datasets.

The promising results from the deep learning models trained on VlogQA suggest significant progress in Vietnamese MRC, particularly for spoken language tasks. However, the paper also highlights the ongoing challenges in processing conversational content, which require further research and improvement.

Overall, the VlogQA dataset and the insights provided in this paper contribute to the growing body of work on enhancing natural language understanding capabilities, with potential applications in areas like voice-based assistants, transcription services, and educational tools for the Vietnamese language.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

YC

0

Reddit

0

Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (textbf{Vi}etnamese textbf{Text}-based textbf{V}isual textbf{Q}uestion textbf{A}nswering dataset) which contains textbf{over 16,000} images and textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this href{https://github.com/minhquan6203/ViTextVQA-Dataset}{link} for research purposes.

Read more

4/17/2024

šŸ‘€

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

YC

0

Reddit

0

Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset. Furthermore, we introduce a novel approach, called VisionReader, which achieved 0.4116 in EM and 0.6990 in the F1-score on the test set. Through the results, we found that the OCR system plays a very important role in VQA models on the ViOCRVQA dataset. In addition, the objects in the image also play a role in improving model performance. We open access to our dataset at link (https://github.com/qhnhynmm/ViOCRVQA.git) for further research in OCR-VQA task in Vietnamese.

Read more

4/30/2024

šŸ’¬

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang

YC

0

Reddit

0

While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.

Read more

4/19/2024

šŸ¤–

Vietnamese AI Generated Text Detection

Quang-Dan Tran, Van-Quan Nguyen, Quang-Huy Pham, K. B. Thang Nguyen, Trong-Hop Do

YC

0

Reddit

0

In recent years, Large Language Models (LLMs) have become integrated into our daily lives, serving as invaluable assistants in completing tasks. Widely embraced by users, the abuse of LLMs is inevitable, particularly in using them to generate text content for various purposes, leading to difficulties in distinguishing between text generated by LLMs and that written by humans. In this study, we present a dataset named ViDetect, comprising 6.800 samples of Vietnamese essay, with 3.400 samples authored by humans and the remainder generated by LLMs, serving the purpose of detecting text generated by AI. We conducted evaluations using state-of-the-art methods, including ViT5, BartPho, PhoBERT, mDeberta V3, and mBERT. These results contribute not only to the growing body of research on detecting text generated by AI but also demonstrate the adaptability and effectiveness of different methods in the Vietnamese language context. This research lays the foundation for future advancements in AI-generated text detection and provides valuable insights for researchers in the field of natural language processing.

Read more

5/7/2024