The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Read original: arXiv:2308.16884 - Published 7/26/2024 by Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

💬

Overview

A new dataset called Belebele is introduced, covering multiple-choice machine reading comprehension (MRC) tasks in 122 language variants.
This dataset aims to expand the language coverage of natural language understanding (NLU) benchmarks, enabling the evaluation of text models in high-, medium-, and low-resource languages.
Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers.
The English dataset alone is challenging enough to test state-of-the-art language models.
The parallel nature of the dataset allows for direct comparison of model performance across all languages.
The dataset is used to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs).

Plain English Explanation

The researchers have created a new dataset called Belebele that is designed to test how well language models can understand text in a wide range of languages. The dataset includes over 120 different language variants, which is significantly more than previous benchmarks.

Each question in the dataset is based on a short passage of text, and the model has to choose the correct answer from four multiple-choice options. The questions were carefully crafted to differentiate between models with varying levels of general language comprehension. Even the English-only portion of the dataset is challenging enough to push the boundaries of the latest language models.

Since the dataset is fully parallel, meaning the same questions and passages are available in all 122 languages, it allows researchers to directly compare how well different models perform across all of those languages. This is used to evaluate the multilingual capabilities of two main types of language models: multilingual masked language models (MLMs) and large language models (LLMs).

The key finding is that while English-centric LLMs do show some ability to transfer knowledge to other languages, smaller MLMs trained on more balanced multilingual data actually understand a much wider range of languages better. The researchers also observe that models with larger vocabularies and more thoughtful vocabulary construction tend to perform better on low-resource languages.

Overall, this new Belebele dataset opens up new opportunities to thoroughly assess and analyze the multilingual natural language understanding capabilities of AI systems.

Technical Explanation

The Belebele dataset is a multiple-choice machine reading comprehension (MRC) dataset that covers 122 language variants. This significantly expands the language coverage compared to previous NLU benchmarks, allowing for the evaluation of text models in high-, medium-, and low-resource languages.

Each question in the dataset is based on a short passage from the Flores-200 dataset and presents the model with four multiple-choice answers to select from. The questions were carefully curated to differentiate between models with varying levels of general language understanding.

The dataset is fully parallel, meaning the same passages and questions are available in all 122 languages. This enables direct comparison of model performance across the entire set of languages.

The researchers use the Belebele dataset to evaluate the multilingual capabilities of two key model types: multilingual masked language models (MLMs) and large language models (LLMs). They find that despite significant cross-lingual transfer abilities in English-centric LLMs, smaller MLMs trained on more balanced multilingual data actually outperform the LLMs in understanding a wider range of languages.

Additionally, the researchers observe that models with larger vocabularies and more thoughtful vocabulary construction tend to perform better on low-resource languages within the Belebele dataset.

Critical Analysis

The Belebele dataset represents an important step forward in evaluating the multilingual capabilities of NLP systems. By expanding language coverage to 122 variants, it pushes the boundaries of existing benchmarks and enables more thorough testing.

However, the paper does acknowledge some potential limitations. For example, the dataset is focused on machine reading comprehension, which may not fully capture all aspects of language understanding. There is also a question of how representative the Flores-200 source passages are of real-world text.

Additionally, while the results provide valuable insights, the researchers note that further analysis is needed to fully understand the factors driving the performance differences between MLMs and LLMs on low-resource languages. The correlation with vocabulary size is an interesting observation, but more research is required to establish causality.

Future work could also explore how the Belebele dataset could be repurposed for other NLP tasks beyond multiple-choice comprehension, or how it could be combined with other multilingual benchmarks to provide an even more comprehensive evaluation.

Conclusion

The Belebele dataset represents a significant advancement in multilingual natural language understanding benchmarks. By expanding language coverage to 122 variants, it enables a more thorough evaluation of the multilingual capabilities of text models. The findings suggest that smaller multilingual masked language models may outperform larger English-centric language models, particularly on low-resource languages, due to factors like vocabulary size and construction.

This dataset opens up new avenues for analyzing and improving the multilingual performance of NLP systems, which is crucial for developing AI technologies that can truly understand and communicate in the diverse range of languages used around the world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

7/26/2024

🌐

Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Teresa Lynn, Malik H. Altakrori, Samar Mohamed Magdy, Rocktim Jyoti Das, Chenyang Lyu, Mohamed Nasr, Younes Samih, Alham Fikri Aji, Preslav Nakov, Shantanu Godbole, Salim Roukos, Radu Florian, Nizar Habash

The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.

4/29/2024

🤖

Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension

Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro

Performance of NLP systems is typically evaluated by collecting a large-scale dataset by means of crowd-sourcing to train a data-driven model and evaluate it on a held-out portion of the data. This approach has been shown to suffer from spurious correlations and the lack of challenging examples that represent the diversity of natural language. Instead, we examine a framework for evaluating optimised models in training-set free setting on synthetically generated challenge sets. We find that despite the simplicity of the generation method, the data can compete with crowd-sourced datasets with regard to naturalness and lexical diversity for the purpose of evaluating the linguistic capabilities of MRC models. We conduct further experiments and show that state-of-the-art language model-based MRC systems can learn to succeed on the challenge set correctly, although, without capturing the general notion of the evaluated phenomenon.

8/12/2024

🤷

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

Anuoluwapo Aremu, Jesujoba O. Alabi, Daud Abolade, Nkechinyere F. Aguobi, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani

In this paper, we create NaijaRC: a new multi-choice Reading Comprehension dataset for three native Nigeria languages that is based on high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE and Belebele training dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.

5/21/2024