TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

2401.16640

Published 4/10/2024 by Nicholas Kluge Corr^ea, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira

Abstract

Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. While most LLMs are trained in high-resource languages like English, multilingual models generally underperform monolingual ones. Additionally, aspects of their multilingual foundation sometimes restrict the byproducts they produce, like computational demands and licensing regimes. In this study, we document the development of open-foundation models tailored for use in low-resource settings, their limitations, and their benefits. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development. See https://github.com/Nkluge-correa/TeenyTinyLlama

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces TeenyTinyLlama, a set of open-source tiny language models trained in Brazilian Portuguese.
The models are designed to be lightweight and efficient, making them suitable for deployment on resource-constrained devices or in low-bandwidth settings.
The researchers explore techniques for training compact language models while maintaining strong performance on a variety of natural language processing tasks.

Plain English Explanation

The researchers have developed a new set of tiny language models called TeenyTinyLlama, which are trained specifically on the Brazilian Portuguese language. These models are designed to be much smaller and more efficient than larger language models, while still being able to perform well on different natural language tasks.

The motivation behind this work is to create language models that can be easily deployed on devices with limited computational resources, such as smartphones or internet-connected sensors. Larger language models can be powerful, but they often require a lot of memory and processing power, which can be a challenge in certain applications.

By training these compact models on Brazilian Portuguese data, the researchers aim to make high-quality natural language processing capabilities more accessible to users and developers in regions where Portuguese is the primary language. This could have applications in areas like virtual assistants, translation services, and content generation.

The researchers explore various techniques for training these tiny models, with the goal of maintaining strong performance while significantly reducing the model size and computational requirements. This involves finding the right balance between model complexity and task-specific accuracy.

Technical Explanation

The researchers propose the TeenyTinyLlama models, which are small-scale language models trained on a corpus of Brazilian Portuguese text. The models are based on the Llama architecture, a lightweight transformer-based model that has shown strong performance on a variety of natural language tasks.

To train the TeenyTinyLlama models, the researchers used a curated dataset of Portuguese web pages, books, and other text sources. They experimented with different model sizes, ranging from just a few million parameters to tens of millions, to find the sweet spot between model complexity and task performance.

The training process involved standard language modeling objectives, such as predicting the next word in a sequence, as well as fine-tuning on specific downstream tasks like text classification, named entity recognition, and question answering. The researchers also explored techniques like knowledge distillation, where a larger, more capable model is used to guide the training of the smaller TeenyTinyLlama models.

Through their experiments, the researchers demonstrated that the TeenyTinyLlama models are able to achieve strong results on a range of Portuguese language tasks, while being significantly more compact and efficient than larger, more complex models. This makes them well-suited for deployment on resource-constrained devices or in low-bandwidth environments.

Critical Analysis

The researchers have made a compelling case for the need for efficient, language-specific models like TeenyTinyLlama, especially in the context of under-resourced languages like Brazilian Portuguese. By focusing on developing compact models, they are addressing an important gap in the availability of high-quality natural language processing tools for these languages.

One potential limitation of the study is the reliance on a single dataset for pre-training the models. While the researchers have curated a diverse corpus of Portuguese text, there may be room for improvement by incorporating additional data sources or exploring techniques like cross-lingual transfer learning to further enhance the models' performance.

Additionally, the researchers could have delved deeper into the model interpretability and fairness aspects, as these are crucial considerations when deploying language models in real-world applications. Aspects like bias mitigation and multilingual capabilities could be explored in future work.

Overall, the TeenyTinyLlama models represent an important step towards making natural language processing more accessible and inclusive, particularly for under-resourced languages. The researchers have demonstrated the feasibility of training compact, high-performing models, which could pave the way for further innovations in efficient multimodal language models and their widespread deployment.

Conclusion

The TeenyTinyLlama models introduced in this paper represent a promising approach to developing efficient, language-specific natural language processing capabilities, with a focus on the Brazilian Portuguese language. By creating compact models that can maintain strong performance on a variety of tasks, the researchers have addressed an important gap in the availability of high-quality NLP tools for under-resourced languages.

The ability to deploy these models on resource-constrained devices or in low-bandwidth settings opens up new possibilities for applications like virtual assistants, translation services, and content generation tailored to Portuguese-speaking users. As the field of language model development continues to evolve, the techniques and insights presented in this paper could inform future work on efficient, multilingual, and multimodal language models that are accessible to a diverse global audience.

Related Papers

💬

Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer

Hele-Andra Kuulmets, Taido Purason, Agnes Luhtaru, Mark Fishel

This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named textsc{Llammas}, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.

4/8/2024

cs.CL

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

4/10/2024

cs.CL cs.AI cs.LG

🔄

Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages

David Ifeoluwa Adelani, A. Seza Dou{g}ruoz, Andr'e Coneglian, Atul Kr. Ojha

Large Language Models are transforming NLP for a variety of tasks. However, how LLMs perform NLP tasks for low-resource languages (LRLs) is less explored. In line with the goals of the AmericasNLP workshop, we focus on 12 LRLs from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs) (e.g., English and Brazilian Portuguese). Our results indicate that the LLMs perform worse for the part of speech (POS) labeling of LRLs in comparison to HRLs. We explain the reasons behind this failure and provide an error analysis through examples observed in our data set.

5/1/2024

cs.CL

💬

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Jakub Hoscilowicz, Pawel Pawlowski, Marcin Skorupa, Marcin Sowa'nski, Artur Janicki

Spoken Language Understanding (SLU) models are a core component of voice assistants (VA), such as Alexa, Bixby, and Google Assistant. In this paper, we introduce a pipeline designed to extend SLU systems to new languages, utilizing Large Language Models (LLMs) that we fine-tune for machine translation of slot-annotated SLU training data. Our approach improved on the MultiATIS++ benchmark, a primary multi-language SLU dataset, in the cloud scenario using an mBERT model. Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%, compared to the existing state-of-the-art method, Fine and Coarse-grained Multi-Task Learning Framework (FC-MTLF). In the on-device scenario (tiny and not pretrained SLU), our method improved the Overall Accuracy from 5.31% to 22.06% over the baseline Global-Local Contrastive Learning Framework (GL-CLeF) method. Contrary to both FC-MTLF and GL-CLeF, our LLM-based machine translation does not require changes in the production architecture of SLU. Additionally, our pipeline is slot-type independent: it does not require any slot definitions or examples.

4/4/2024

cs.CL