KazQAD: Kazakh Open-Domain Question Answering Dataset

2404.04487

Published 4/9/2024 by Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov, Ardak Shalkarbayuli, Pavel Braslavski

KazQAD: Kazakh Open-Domain Question Answering Dataset

Abstract

We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.

Create account to get full access

Overview

This paper introduces KazQAD, a new open-domain question answering dataset for the Kazakh language.
The dataset was created to address the lack of resources for Kazakh language AI research and applications.
KazQAD contains over 12,000 human-written questions and answers covering a wide range of topics.
The authors also provide baseline models and evaluation results to establish benchmarks for future research.

Plain English Explanation

KazQAD: Kazakh Open-Domain Question Answering Dataset is a new dataset that aims to advance Kazakh language AI. Kazakh is a Turkic language spoken in Central Asia, but there has been limited research and development of AI systems that can understand and respond to Kazakh language content.

To address this gap, the researchers created KazQAD, a collection of over 12,000 questions and answers covering a wide variety of topics. The questions were written by human annotators, and the answers were either extracted from relevant text or generated. This dataset provides a valuable resource for training and evaluating Kazakh language question answering models.

The authors also provide baseline models and evaluation results to help other researchers build upon this work. By establishing benchmarks for Kazakh question answering, the hope is to spur more innovation and progress in this important area of Kazakh language AI.

Technical Explanation

The KazQAD dataset was created to advance open-domain question answering (QA) for the Kazakh language. The authors compiled a corpus of over 12,000 Kazakh questions and answers, with the questions written by human annotators and the answers either extracted from relevant text or generated.

To create the dataset, the researchers first collected a large corpus of Kazakh web pages and books. They then used crowdsourcing to have human annotators write diverse, open-ended questions that could be answered using the collected text. The annotators were given guidelines to ensure the questions covered a wide range of topics and were well-formed.

For each question, the authors used information retrieval and passage ranking techniques to identify the most relevant text snippets. They then had annotators extract the answer from these snippets or, if necessary, generate the answer themselves. This process resulted in the final KazQAD dataset, which the authors released publicly to enable further research.

In addition to the dataset, the paper also provides baseline models and evaluation results. The authors fine-tuned several state-of-the-art QA models, including BERT and T5, on the KazQAD data and reported their performance. These baselines can serve as a starting point for other researchers working on Kazakh QA.

Critical Analysis

The KazQAD dataset represents an important step forward for Kazakh language AI research, but there are a few limitations to consider. First, while the dataset covers a wide range of topics, the authors acknowledge that it may not be fully representative of all Kazakh language use cases. There could be biases in the types of questions and answers included.

Additionally, the baseline models provided in the paper, while useful, may not reflect the full potential of Kazakh QA systems. The authors used relatively simple fine-tuning approaches, and more advanced techniques like multilingual pretraining or data augmentation could potentially yield better results.

Further research is also needed to understand how well KazQAD-trained models would perform on real-world Kazakh language tasks beyond just question answering. Deployment in practical applications would require additional testing and validation.

Despite these limitations, the KazQAD dataset represents an important contribution that can spur more research and development in Kazakh language AI. By providing a high-quality benchmark, the authors have laid the groundwork for continued advancements in this critical area.

Conclusion

The KazQAD: Kazakh Open-Domain Question Answering Dataset paper introduces a valuable new resource for Kazakh language AI research. By creating a large-scale dataset of Kazakh questions and answers, the authors have enabled the development of more capable Kazakh question answering systems.

The baseline models and evaluation results provided in the paper can serve as a starting point for future work, while the open-source release of the dataset itself will allow researchers to build upon these foundations. As Kazakh language AI continues to advance, resources like KazQAD will play a crucial role in driving progress and unlocking new applications that can benefit Kazakh-speaking communities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Building Efficient and Effective OpenQA Systems for Low-Resource Languages

Emrah Budur, R{i}za Ozc{c}elik, Dilara Soylu, Omar Khattab, Tunga Gungor, Christopher Potts

Question answering (QA) is the task of answering questions posed in natural language with free-form natural language answers extracted from a given passage. In the OpenQA variant, only a question text is given, and the system must retrieve relevant passages from an unstructured knowledge source and use them to provide answers, which is the case in the mainstream QA systems on the Web. QA systems currently are mostly limited to the English language due to the lack of large-scale labeled QA datasets in non-English languages. In this paper, we show that effective, low-cost OpenQA systems can be developed for low-resource contexts. The key ingredients are (1) weak supervision using machine-translated labeled datasets and (2) a relevant unstructured knowledge source in the target language context. Furthermore, we show that only a few hundred gold assessment examples are needed to reliably evaluate these systems. We apply our method to Turkish as a challenging case study, since English and Turkish are typologically very distinct and Turkish has limited resources for QA. We present SQuAD-TR, a machine translation of SQuAD2.0, and we build our OpenQA system by adapting ColBERT-QA and retraining it over Turkish resources and SQuAD-TR using two versions of Wikipedia dumps spanning two years. We obtain a performance improvement of 24-32% in the Exact Match (EM) score and 22-29% in the F1 score compared to the BM25-based and DPR-based baseline QA reader models. Our results show that SQuAD-TR makes OpenQA feasible for Turkish, which we hope encourages researchers to build OpenQA systems in other low-resource languages. We make all the code, models, and the dataset publicly available at https://github.com/boun-tabi/SQuAD-TR.

6/6/2024

cs.CL

💬

Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models

Akchay Srivastava, Atif Memon

Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. Recent advances stem from the confluence of several factors, such as large-scale training datasets, deep learning techniques, and the rise of large language models. High-quality datasets are used to train models on realistic scenarios and enable the evaluation of the system on potentially unseen data. Standardized metrics facilitate comparisons between different ODQA systems, allowing researchers to objectively track advancements in the field. Our study presents a thorough examination of the current landscape of ODQA benchmarking by reviewing 52 datasets and 20 evaluation techniques across textual and multimodal modalities. We introduce a novel taxonomy for ODQA datasets that incorporates both the modality and difficulty of the question types. Additionally, we present a structured organization of ODQA evaluation metrics along with a critical analysis of their inherent trade-offs. Our study aims to empower researchers by providing a framework for the robust evaluation of modern question-answering systems. We conclude by identifying the current challenges and outlining promising avenues for future research and development.

6/21/2024

cs.CL cs.AI cs.IR cs.LG

👁️

TANQ: An open domain dataset of table answered questions

Mubashara Akhtar, Chenxi Pang, Andreea Marzoca, Yasemin Altun, Julian Martin Eisenschlos

Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.

5/14/2024

cs.CL

KET-QA: A Dataset for Knowledge Enhanced Table Question Answering

Mengkang Hu, Haoyu Dong, Ping Luo, Shi Han, Dongmei Zhang

Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: https://ketqa.github.io/.

5/15/2024

cs.CL