CreoleVal: Multilingual Multitask Benchmarks for Creoles

Read original: arXiv:2310.19567 - Published 5/7/2024 by Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau and 11 others

⚙️

Overview

Creole languages are an underexplored and underrepresented group in natural language processing (NLP) research due to a lack of available resources.
Although Creoles share linguistic ties with well-resourced languages, the scarcity of annotated data hinders the potential for effective transfer learning.
The paper introduces CreoleVal, a collection of benchmark datasets covering 8 different NLP tasks across 28 Creole languages.
The goal is to empower research on Creoles in NLP and computational linguistics, and to promote more equitable language technology worldwide.

Plain English Explanation

Creole languages are a unique and fascinating group of languages that have developed from a mixture of other languages, often through the process of colonization. Unfortunately, these languages have been largely overlooked and underrepresented in the field of natural language processing (NLP), which is the area of computer science that focuses on how machines can understand and process human language.

One of the key challenges is the lack of available data and resources for Creole languages. While Creoles are related to a number of well-studied languages, the scarcity of annotated data (data that has been labeled and organized for machine learning) makes it difficult to effectively apply techniques like transfer learning, where knowledge from one language model can be applied to another related language.

To address this gap, the researchers have created CreoleVal, a collection of benchmark datasets that cover a wide range of NLP tasks, including reading comprehension, relation classification, and machine translation, across 28 different Creole languages. This resource is designed to empower and enable more research on Creole languages within the field of NLP and computational linguistics.

By making these datasets available, the researchers hope to ignite more interest and attention on Creole languages, which have been historically marginalized. Ultimately, this work is a step towards more equitable and inclusive language technology that can benefit people around the globe, regardless of their linguistic background.

Technical Explanation

The paper presents CreoleVal, a collection of benchmark datasets for 8 different NLP tasks covering up to 28 Creole languages. This resource is designed to address the lack of available data and resources for Creole languages, which have been largely overlooked in the NLP research community.

The researchers note that while Creoles share linguistic ties with a number of well-resourced languages, the scarcity of annotated data has hampered the potential for effective transfer learning. To address this, they have aggregated novel development datasets for tasks such as reading comprehension, relation classification, and machine translation, in addition to including a handful of preexisting benchmarks.

For each benchmark, the researchers conduct baseline experiments in a zero-shot setting, which means they evaluate the performance of models without any fine-tuning or additional training on the Creole language data. This allows them to assess the capabilities and limitations of transfer learning for Creole languages.

The CreoleVal dataset is presented as an opportunity to empower and expand research on Creoles in NLP and computational linguistics. By making these resources publicly available, the researchers hope to contribute towards more equitable and inclusive language technology around the world.

Critical Analysis

The paper highlights an important and often overlooked issue in the field of NLP: the lack of representation and resources for Creole languages. The creation of the CreoleVal dataset is a valuable contribution, as it provides a starting point for researchers to explore the potential of transfer learning and other techniques for Creole languages.

One potential limitation of the study is the scope of the dataset, which covers 28 Creole languages. While this is a significant improvement over the current state of research, there are many more Creole languages around the world that could benefit from similar attention and resources. Additionally, the paper does not delve deeply into the specific challenges and nuances of each Creole language, which could inform future research and dataset curation efforts.

Further, the CreoleVal dataset is primarily focused on evaluating the capabilities of existing NLP models in a zero-shot setting. While this is a valuable starting point, additional research is needed to explore more advanced techniques, such as fine-tuning or multilingual modeling, to fully unlock the potential of Creole languages.

Overall, the CreoleVal dataset is a commendable step towards more inclusive and equitable language technology, as highlighted by the researchers' Portulan and GeNIL initiatives. However, continued efforts and further research will be necessary to truly empower underrepresented linguistic communities, as exemplified by the CroissantLLM and How Good Are Large Language Models on African Languages? studies.

Conclusion

The paper introduces CreoleVal, a valuable collection of benchmark datasets for Creole languages, which have been historically underrepresented in NLP research. By making these resources available, the researchers aim to empower and expand research on Creoles, ultimately contributing towards more equitable and inclusive language technology around the globe.

While the CreoleVal dataset represents a significant step forward, continued efforts and further research will be necessary to fully unlock the potential of Creole languages and ensure that the benefits of language technology are accessible to all linguistic communities, regardless of their size or status.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

CreoleVal: Multilingual Multitask Benchmarks for Creoles

Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loic Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders S{o}gaard, Johannes Bjerva

Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension, relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.

5/7/2024

Krey`ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

Nathaniel R. Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Bizon Monroc, Loic Grobol, Hasan Muhammad, Ashi Garg, Naome A. Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Dean Stutzman, Bismarck Bamfo Odoom, Sanjeev Khudanpur, Stephen D. Richardson, Kenton Murray

A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions.

5/14/2024

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge, Pontus Stenetorp

Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.

6/6/2024

One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support

Ronja Stern, Vishvaksenan Rasiah, Veton Matoshi, Srinanda Brugger Bose, Matthias Sturmer, Ilias Chalkidis, Daniel E. Ho, Joel Niklaus

Recent strides in Large Language Models (LLMs) have saturated many Natural Language Processing (NLP) benchmarks, emphasizing the need for more challenging ones to properly assess LLM capabilities. However, domain-specific and multilingual benchmarks are rare because they require in-depth expertise to develop. Still, most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. In this work, we introduce a novel NLP benchmark for the legal domain that challenges LLMs in five key dimensions: processing emph{long documents} (up to 50K tokens), using emph{domain-specific knowledge} (embodied in legal texts), emph{multilingual} understanding (covering five languages), emph{multitasking} (comprising legal document-to-document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks) and emph{reasoning} (comprising especially Court View Generation, but also the Text Classification tasks). Our benchmark contains diverse datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual legal system. Despite the large size of our datasets (some with hundreds of thousands of examples), existing publicly available multilingual models struggle with most tasks, even after extensive in-domain pre-training and fine-tuning. We publish all resources (benchmark suite, pre-trained models, code) under permissive open CC BY-SA licenses.

8/22/2024