Krey`ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

2405.05376

Published 5/14/2024 by Nathaniel R. Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Bizon Monroc, Loic Grobol, Hasan Muhammad, Ashi Garg, Naome A. Etori and 7 others

cs.CL

Krey`ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

Abstract

A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations -- 11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages -- the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity than ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 26 of 34 translation directions.

Create account to get full access

Overview

This paper focuses on building machine translation (MT) systems for Latin American, Caribbean, and Colonial African Creole languages, which have been historically underrepresented in natural language processing research.
The authors propose Kreyòl-MT, a framework for developing high-quality MT models for these low-resource languages by leveraging multilingual and multitask learning techniques.
The paper also introduces CreoleEval, a comprehensive benchmark for evaluating MT systems on Creole languages, and provides insights into the performance of large language models on these languages.

Plain English Explanation

The paper discusses the challenge of developing high-quality machine translation (MT) systems for Creole languages, which are languages that have evolved from a mixture of other languages, often as a result of colonization. These Creole languages, found in Latin America, the Caribbean, and parts of Africa, have traditionally been underrepresented in natural language processing research, making it difficult to build reliable MT systems for them.

To address this gap, the authors propose "Kreyòl-MT," a framework that leverages multilingual and multitask learning techniques to develop high-quality MT models for these low-resource Creole languages. The paper also introduces "CreoleEval," a comprehensive benchmark for evaluating the performance of MT systems on Creole languages.

Additionally, the researchers provide insights into the performance of large language models on Creole languages, which can inform the development of more effective MT systems for these underserved languages.

Technical Explanation

The paper presents the Kreyòl-MT framework, which aims to build high-quality machine translation (MT) systems for Latin American, Caribbean, and Colonial African Creole languages. These languages have been historically underrepresented in natural language processing research, making it challenging to develop reliable MT models for them.

Kreyòl-MT leverages multilingual and multitask learning techniques to address the low-resource nature of Creole languages. The authors propose a multilingual model architecture that shares parameters across multiple Creole languages, allowing the model to learn cross-lingual representations and benefit from data-sharing across related languages.

Additionally, the researchers introduce CreoleEval, a comprehensive benchmark for evaluating MT systems on Creole languages. CreoleEval includes a diverse set of language pairs and domains, providing a robust evaluation framework for assessing the performance of MT models on these underserved languages.

The paper also investigates the performance of large language models on Creole languages, shedding light on the strengths and limitations of these models in the context of low-resource machine translation. The insights gained from this analysis can inform the development of more effective MT systems for Creole languages.

Critical Analysis

The Kreyòl-MT framework and the CreoleEval benchmark presented in this paper are valuable contributions to the field of natural language processing, particularly for addressing the underrepresentation of Creole languages in MT research.

One potential limitation of the study is the availability and quality of the dataset used for training and evaluating the MT models. The authors acknowledge that the Creole language data is often scarce and of varying quality, which could impact the performance of the developed systems.

Additionally, the paper does not explore the potential biases or sociocultural implications of deploying MT systems for Creole languages, which could be an important consideration, given the historical context of colonization and language suppression in many of these regions.

Further research could investigate techniques for improving the robustness and generalization of MT models for Creole languages, as well as explore the ethical and societal implications of developing these systems.

Conclusion

This paper presents a significant step forward in the field of machine translation for historically underrepresented Creole languages. The Kreyòl-MT framework and the CreoleEval benchmark provide a foundation for developing high-quality MT systems for these languages, which can have a profound impact on communication, education, and access to information for millions of people worldwide.

The insights gained from this research can also inform the development of more effective MT systems for other low-resource languages, potentially contributing to a more inclusive and equitable natural language processing landscape.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⚙️

CreoleVal: Multilingual Multitask Benchmarks for Creoles

Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loic Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders S{o}gaard, Johannes Bjerva

Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension, relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.

5/7/2024

cs.CL cs.AI

💬

How good are Large Language Models on African Languages?

Jessica Ojo, Kelechi Ogueji, Pontus Stenetorp, David Ifeoluwa Adelani

Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on tasks and languages they are not trained on. However, their performance on African languages is largely understudied relative to high-resource languages. We present an analysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks (topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition) across 60 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages (such as English) for most tasks. We find that GPT-4 has an average to good performance on classification tasks, yet its performance on generative tasks such as machine translation and summarization is significantly lacking. Surprisingly, we find that mT0 had the best overall performance for cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we find the recent Aya model to have comparable result to mT0 in almost all tasks except for topic classification where it outperform mT0. Overall, LLaMa 2 showed the worst performance, which we believe is due to its English and code-centric~(around 98%) pre-training corpus. Our findings confirms that performance on African languages continues to remain a hurdle for the current LLMs, underscoring the need for additional efforts to close this gap.

5/1/2024

cs.CL cs.AI cs.LG

💬

Machine Translation for Ge'ez Language

Aman Kassahun Wassie

Machine translation (MT) for low-resource languages such as Ge'ez, an ancient language that is no longer the native language of any community, faces challenges such as out-of-vocabulary words, domain mismatches, and lack of sufficient labeled training data. In this work, we explore various methods to improve Ge'ez MT, including transfer-learning from related languages, optimizing shared vocabulary and token segmentation approaches, finetuning large pre-trained models, and using large language models (LLMs) for few-shot translation with fuzzy matches. We develop a multilingual neural machine translation (MNMT) model based on languages relatedness, which brings an average performance improvement of about 4 BLEU compared to standard bilingual models. We also attempt to finetune the NLLB-200 model, one of the most advanced translation models available today, but find that it performs poorly with only 4k training samples for Ge'ez. Furthermore, we experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches, which leverages embedding similarity-based retrieval to find context examples from a parallel corpus. We observe that GPT-3.5 achieves a remarkable BLEU score of 9.2 with no initial knowledge of Ge'ez, but still lower than the MNMT baseline of 15.2. Our work provides insights into the potential and limitations of different approaches for low-resource and ancient language MT.

4/16/2024

cs.CL

Feriji: A French-Zarma Parallel Corpus, Glossary & Translator

Mamadou K. Keita, Elysabhete Amadou Ibrahim, Habibatou Abdoulaye Alfari, Christopher Homan

Machine translation (MT) is a rapidly expanding field that has experienced significant advancements in recent years with the development of models capable of translating multiple languages with remarkable accuracy. However, the representation of African languages in this field still needs to improve due to linguistic complexities and limited resources. This applies to the Zarma language, a dialect of Songhay (of the Nilo-Saharan language family) spoken by over 5 million people across Niger and neighboring countries cite{lewis2016ethnologue}. This paper introduces Feriji, the first robust French-Zarma parallel corpus and glossary designed for MT. The corpus, containing 61,085 sentences in Zarma and 42,789 in French, and a glossary of 4,062 words represent a significant step in addressing the need for more resources for Zarma. We fine-tune three large language models on our dataset, obtaining a BLEU score of 30.06 on the best-performing model. We further evaluate the models on human judgments of fluency, comprehension, and readability and the importance and impact of the corpus and models. Our contributions help to bridge a significant language gap and promote an essential and overlooked indigenous African language.

6/19/2024

cs.CL