InkubaLM: A small language model for low-resource African languages

Read original: arXiv:2408.17024 - Published 9/4/2024 by Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Anuoluwapo Aremu, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate and 1 other

InkubaLM: A small language model for low-resource African languages

Overview

Introduces InkubaLM, a small language model for low-resource African languages
Focuses on improving language modeling for under-represented languages
Designed to be efficient and accessible for real-world deployment

Plain English Explanation

InkubaLM: A small language model for low-resource African languages is a research paper that describes the development of a compact language model aimed at supporting less commonly used African languages. The goal is to create an efficient and accessible solution for improving natural language processing capabilities in these under-resourced linguistic contexts.

Many African languages have limited digital content and data available, making it challenging to build robust language models. InkubaLM addresses this problem by designing a smaller, more lightweight model that can be deployed more easily. The researchers leveraged techniques like parameter sharing and distillation to create a high-performing model with a reduced footprint.

By focusing on low-resource African languages, this work seeks to make natural language technologies more inclusive and accessible to diverse global populations. The researchers hope that InkubaLM can serve as a foundation for future advancements in language AI that better support marginalized communities.

Technical Explanation

The paper describes the development of InkubaLM, a compact language model designed for low-resource African languages. To address the challenge of limited data availability, the researchers employed techniques like parameter sharing and knowledge distillation to create an efficient model architecture.

The team evaluated InkubaLM on a diverse set of African languages, including Swahili, Yoruba, and Amharic. They compared the model's performance to larger, more complex language models as well as other compact alternatives. The results showed that InkubaLM achieved strong performance while maintaining a significantly smaller footprint, making it well-suited for real-world deployment in resource-constrained environments.

Key insights from the technical analysis include the effectiveness of distillation in reducing model size without compromising accuracy, as well as the value of pretraining on multilingual corpora to leverage cross-lingual knowledge. The researchers also discuss the potential of InkubaLM to serve as a foundation for further advancements in African language AI.

Critical Analysis

The InkubaLM paper presents a promising approach to improving language modeling for low-resource African languages. By prioritizing efficiency and accessibility, the researchers have developed a model that could have significant real-world impact.

However, the paper acknowledges several limitations that warrant further investigation. The evaluation was conducted on a relatively small set of languages, so the broader applicability of InkubaLM across the diverse African linguistic landscape remains to be seen. Additionally, the researchers note that the model's performance could be further improved by incorporating more specialized data sources and fine-tuning techniques.

Future research could also explore ways to make the model-building process even more accessible, such as through automated or user-friendly tools. Addressing potential biases or fairness issues in the model's outputs would also be an important area for additional study.

Overall, the InkubaLM work represents a valuable contribution to the field of language AI, with the potential to enable more inclusive and equitable natural language technologies for underserved communities.

Conclusion

InkubaLM is a compact language model designed to improve natural language processing capabilities for low-resource African languages. By employing techniques like parameter sharing and knowledge distillation, the researchers have created an efficient model that can be more readily deployed in real-world settings.

The paper's focus on supporting marginalized linguistic communities is a notable strength, as it aligns with broader efforts to make language AI more inclusive and accessible globally. While the current evaluation is limited in scope, the general approach and insights presented in the work could have significant implications for the future of language technology development.

Overall, the InkubaLM research represents an important step towards bridging the gap in language AI support for under-represented languages and populations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

InkubaLM: A small language model for low-resource African languages

Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Anuoluwapo Aremu, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman

High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available at https://huggingface.co/lelapa to encourage research and development on low-resource languages.

9/4/2024

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge, Pontus Stenetorp

Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.

6/6/2024

💬

EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation

Atnafu Lambebo Tonja, Israel Abebe Azime, Tadesse Destaw Belay, Mesay Gemeda Yigezu, Moges Ahmed Mehamed, Abinew Ali Ayele, Ebrahim Chekol Jibril, Michael Melese Woldeyohannis, Olga Kolesnikova, Philipp Slusallek, Dietrich Klakow, Shengwu Xiong, Seid Muhie Yimam

Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM -- multilingual large language models for five Ethiopian languages (Amharic, Ge'ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark -- a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.

6/26/2024

💬

How good are Large Language Models on African Languages?

Jessica Ojo, Kelechi Ogueji, Pontus Stenetorp, David Ifeoluwa Adelani

Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on tasks and languages they are not trained on. However, their performance on African languages is largely understudied relative to high-resource languages. We present an analysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks (topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition) across 60 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages (such as English) for most tasks. We find that GPT-4 has an average to good performance on classification tasks, yet its performance on generative tasks such as machine translation and summarization is significantly lacking. Surprisingly, we find that mT0 had the best overall performance for cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we find the recent Aya model to have comparable result to mT0 in almost all tasks except for topic classification where it outperform mT0. Overall, LLaMa 2 showed the worst performance, which we believe is due to its English and code-centric~(around 98%) pre-training corpus. Our findings confirms that performance on African languages continues to remain a hurdle for the current LLMs, underscoring the need for additional efforts to close this gap.

5/1/2024