NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

Read original: arXiv:2308.09768 - Published 5/21/2024 by Anuoluwapo Aremu, Jesujoba O. Alabi, Daud Abolade, Nkechinyere F. Aguobi, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani

🤷

Overview

The researchers create a new multi-choice Reading Comprehension dataset called NaijaRC for three native Nigerian languages based on high-school reading comprehension exams.
They provide baseline results by performing cross-lingual transfer using existing English RACE and Belebele datasets with a pre-trained encoder-only model.
Additionally, they provide results by prompting large language models (LLMs) like GPT-4.

Plain English Explanation

The researchers have developed a new dataset called NaijaRC that contains reading comprehension questions and answers in three native Nigerian languages. This dataset is based on high-school level reading exams in these languages.

To establish a baseline for how well machine learning models can perform on this new dataset, the researchers used two existing English reading comprehension datasets, RACE and Belebele, and fine-tuned a pre-trained encoder-only model on them. This allowed the model to transfer its learning from English to the Nigerian languages.

The researchers also tested how well large language models (LLMs) like GPT-4 could perform on the NaijaRC dataset when prompted to answer the reading comprehension questions.

Technical Explanation

The researchers created the NaijaRC dataset, which contains multi-choice reading comprehension questions and answers in three native Nigerian languages: Hausa, Igbo, and Yoruba. They sourced the questions and passages from high-school level reading comprehension exams in these languages.

To establish baseline performance, the researchers used cross-lingual transfer learning. They fine-tuned a pre-trained encoder-only model, like BERT, on the English RACE and Belebele datasets. This allowed the model to leverage its English language understanding to perform well on the Nigerian language datasets.

Additionally, the researchers prompted large language models (LLMs) like GPT-4 to answer the NaijaRC questions directly, without any fine-tuning. This provided a sense of how capable these powerful language models are at reading comprehension in non-English languages.

Critical Analysis

The researchers acknowledge that their dataset, NaijaRC, is relatively small compared to widely used English reading comprehension datasets. This may limit the ability of machine learning models to fully generalize and perform well on this new dataset.

Furthermore, the cross-lingual transfer learning approach relies on the assumption that the English datasets used for fine-tuning are similar enough in structure and content to the Nigerian language datasets. This may not always be the case, and could impact the performance of the transferred model.

The researchers did not provide extensive analysis on the types of errors made by the models or the specific challenges presented by the Nigerian language datasets. Further research into these areas could provide valuable insights for improving reading comprehension in low-resource languages.

Conclusion

The researchers have made a valuable contribution by creating the NaijaRC dataset, which can help drive progress in reading comprehension for three native Nigerian languages. Their use of cross-lingual transfer learning and prompting of large language models provides a solid baseline for future work in this area.

As the field of multilingual natural language processing continues to advance, datasets like NaijaRC will become increasingly important for ensuring that the benefits of these technologies are accessible to speakers of diverse languages, particularly those that have been historically underrepresented in AI research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

Anuoluwapo Aremu, Jesujoba O. Alabi, Daud Abolade, Nkechinyere F. Aguobi, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani

In this paper, we create NaijaRC: a new multi-choice Reading Comprehension dataset for three native Nigeria languages that is based on high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE and Belebele training dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.

5/21/2024

💬

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

7/26/2024

🌐

Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Teresa Lynn, Malik H. Altakrori, Samar Mohamed Magdy, Rocktim Jyoti Das, Chenyang Lyu, Mohamed Nasr, Younes Samih, Alham Fikri Aji, Preslav Nakov, Shantanu Godbole, Salim Roukos, Radu Florian, Nizar Habash

The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.

4/29/2024

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Md. Arid Hasan, Maram Hasanain, Fatema Ahmad, Sahinur Rahman Laskar, Sunaya Upadhyay, Vrunda N Sukhadia, Mucahid Kutlu, Shammur Absar Chowdhury, Firoj Alam

Natural Question Answering (QA) datasets play a crucial role in developing and evaluating the capabilities of large language models (LLMs), ensuring their effective usage in real-world applications. Despite the numerous QA datasets that have been developed, there is a notable lack of region-specific datasets generated by native users in their own languages. This gap hinders the effective benchmarking of LLMs for regional and cultural specificities. In this study, we propose a scalable framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. Moreover, to demonstrate the efficacy of the proposed framework, we designed a multilingual natural QA dataset, MultiNativQA, consisting of ~72K QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers covering 18 topics. We benchmark the MultiNativQA dataset with open- and closed-source LLMs. We made both the framework NativQA and MultiNativQA dataset publicly available for the community. (https://nativqa.gitlab.io)

7/16/2024