How good are Large Language Models on African Languages?

2311.07978

Published 5/1/2024 by Jessica Ojo, Kelechi Ogueji, Pontus Stenetorp, David Ifeoluwa Adelani

💬

Abstract

Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on tasks and languages they are not trained on. However, their performance on African languages is largely understudied relative to high-resource languages. We present an analysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks (topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition) across 60 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages (such as English) for most tasks. We find that GPT-4 has an average to good performance on classification tasks, yet its performance on generative tasks such as machine translation and summarization is significantly lacking. Surprisingly, we find that mT0 had the best overall performance for cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we find the recent Aya model to have comparable result to mT0 in almost all tasks except for topic classification where it outperform mT0. Overall, LLaMa 2 showed the worst performance, which we believe is due to its English and code-centric~(around 98%) pre-training corpus. Our findings confirms that performance on African languages continues to remain a hurdle for the current LLMs, underscoring the need for additional efforts to close this gap.

Get summaries of the top AI research delivered straight to your inbox:

Overview

The paper examines the performance of four large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks across 60 African languages.
The results show that the language models generally perform worse on African languages compared to high-resource languages like English.
The paper highlights the need for more efforts to improve the performance of language models on African languages.

Plain English Explanation

Large language models (LLMs) like GPT-4 and LLaMa 2 have become very capable at understanding and generating human-like text. These models are trained on vast amounts of data from the internet and can perform well on a variety of tasks, even for languages they weren't specifically trained on.

However, the researchers found that the performance of these LLMs is significantly lower for African languages compared to high-resource languages like English. They tested four popular LLMs on tasks like topic classification, sentiment analysis, machine translation, summarization, question answering, and named entity recognition across 60 African languages.

The results suggest that while models like GPT-4 can do reasonably well on some classification tasks for African languages, they struggle with more complex generative tasks like translation and summarization. The researchers also found that the mT0 model performed the best overall on the cross-lingual question answering task, even outperforming a specialized model trained on African languages.

The paper highlights the need for more research and development to improve the performance of LLMs on African languages. This is important because these models are increasingly being used in a wide range of applications, and their poor performance on underrepresented languages could lead to biases and exclusion.

Technical Explanation

The researchers in this paper conducted an extensive evaluation of four popular large language models - mT0, Aya, LLaMa 2, and GPT-4 - on six different tasks across 60 African languages.

The tasks included topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition. The researchers chose a diverse set of African languages spanning different language families and geographical regions to get a comprehensive understanding of the models' performance.

The results showed that all the LLMs performed significantly worse on the African languages compared to high-resource languages like English. There was a large gap in performance, particularly for the generative tasks like machine translation and summarization.

Interestingly, the researchers found that the mT0 model performed the best overall on the cross-lingual question answering task, even outperforming a specialized model (fine-tuned mT5) and GPT-4. The Aya model also showed comparable results to mT0 across most tasks, except for topic classification, where it outperformed mT0.

On the other hand, the LLaMa 2 model exhibited the worst performance, which the researchers attribute to its heavily English and code-centric pre-training corpus (around 98%).

Critical Analysis

The paper provides valuable insights into the current limitations of large language models when it comes to processing African languages. The researchers have conducted a comprehensive evaluation across a wide range of tasks and languages, which gives us a clear picture of the performance gaps.

One potential concern is the lack of a detailed analysis of the factors contributing to the poor performance of the models on African languages. The paper mentions that the pre-training corpus composition may play a role, but a more in-depth investigation into the specific challenges, such as linguistic diversity, data availability, or model architecture limitations, could have provided a more complete understanding.

Additionally, the paper does not discuss the potential societal implications of these performance gaps. As LLMs become more widely deployed in various applications, their poor performance on underrepresented languages could lead to biases and exclusion, which should be further explored and addressed.

The researchers also acknowledge the need for additional efforts to close the performance gap, but they do not provide specific recommendations or a roadmap for how this can be achieved. Exploring potential solutions, such as enhancing existing models or developing specialized models for African languages, could have strengthened the paper's impact and practical relevance.

Conclusion

This paper highlights a significant challenge in the field of natural language processing: the performance disparity between large language models and African languages. The researchers have conducted a comprehensive evaluation across a wide range of tasks and models, revealing that current LLMs struggle to match their high-resource language performance when it comes to African languages.

The findings underscore the need for more concerted efforts to improve the capabilities of language models on underrepresented languages. As these models become increasingly prominent in various applications, ensuring equitable and inclusive performance is crucial. The paper serves as a call to action for the research community to address this pressing issue and work towards closing the performance gap for African languages.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Xuanfan Ni, Piji Li

Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.

5/17/2024

cs.CL

💬

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA models, GPT-4-Vision and Gemini-Pro-Vision. Our experiments show that larger models such as GPT-4, Gemini-Pro and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.

4/4/2024

cs.CL

💬

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Jakub Hoscilowicz, Pawel Pawlowski, Marcin Skorupa, Marcin Sowa'nski, Artur Janicki

Spoken Language Understanding (SLU) models are a core component of voice assistants (VA), such as Alexa, Bixby, and Google Assistant. In this paper, we introduce a pipeline designed to extend SLU systems to new languages, utilizing Large Language Models (LLMs) that we fine-tune for machine translation of slot-annotated SLU training data. Our approach improved on the MultiATIS++ benchmark, a primary multi-language SLU dataset, in the cloud scenario using an mBERT model. Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%, compared to the existing state-of-the-art method, Fine and Coarse-grained Multi-Task Learning Framework (FC-MTLF). In the on-device scenario (tiny and not pretrained SLU), our method improved the Overall Accuracy from 5.31% to 22.06% over the baseline Global-Local Contrastive Learning Framework (GL-CLeF) method. Contrary to both FC-MTLF and GL-CLeF, our LLM-based machine translation does not require changes in the production architecture of SLU. Additionally, our pipeline is slot-type independent: it does not require any slot definitions or examples.

4/4/2024

cs.CL

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

4/10/2024

cs.CL cs.AI cs.LG