Open Generative Large Language Models for Galician

2406.13893

YC

0

Reddit

0

Published 6/21/2024 by Pablo Gamallo, Pablo Rodr'iguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, Jos'e Ramom Pichel, Marcos Garcia
Open Generative Large Language Models for Galician

Abstract

Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents research on developing open-source generative large language models (LLMs) for the Galician language, which is a minority language spoken in northwestern Spain.
  • The researchers aim to address the lack of LLMs and other language technologies for low-resource languages like Galician.
  • They describe their methodology for training and evaluating multiple Galician LLM models, and discuss the performance and potential applications of these models.

Plain English Explanation

Generative large language models (LLMs) are powerful AI systems that can generate human-like text on a wide range of topics. These models have been trained on vast amounts of data, allowing them to understand and produce language in sophisticated ways. However, most LLM research and development has focused on major global languages like English, Chinese, and Spanish.

This paper explores the creation of open-source LLMs for the Galician language, which is spoken by a relatively small number of people in northwestern Spain. Galician is considered a "low-resource" language, meaning it has limited digital content and language technologies available compared to more widely used languages.

The researchers trained several Galician LLM models using techniques similar to those used for more prominent languages. They then evaluated the models' performance on various language tasks, such as text generation, translation, and question answering. The results suggest these Galician LLMs can be useful for applications like [internal link: https://aimodels.fyi/papers/arxiv/gentranslate-large-language-models-are-generative-multilingual] machine translation, [internal link: https://aimodels.fyi/papers/arxiv/llamaturk-adapting-open-source-generative-large-language] content creation, and other language-related services.

By making these Galician LLMs open-source, the researchers hope to spur further development and use of language technology for this minority language, ultimately empowering Galician speakers and preserving their linguistic and cultural heritage.

Technical Explanation

The researchers first reviewed the existing work on LLMs for Iberian languages like Spanish and Portuguese, noting the lack of similar models for Galician. They then described their methodology for training several Galician LLMs using techniques like [internal link: https://aimodels.fyi/papers/arxiv/investigating-translation-capabilities-large-language-models-trained] unsupervised pretraining on large text corpora, followed by fine-tuning on specialized Galician datasets.

The models were evaluated on a range of language tasks, including text generation, translation, and question answering. The results showed the Galician LLMs performed reasonably well, with performance approaching that of LLMs for more widely-spoken languages. The researchers also explored the models' ability to handle code-switching between Galician and Spanish, a common occurrence in real-world Galician language use.

To ensure the sustainability and accessibility of these Galician LLMs, the researchers made the models and training code available as open-source resources. This aligns with their goal of [internal link: https://aimodels.fyi/papers/arxiv/llamaturk-adapting-open-source-generative-large-language] promoting the development of language technologies for low-resource languages.

Critical Analysis

The paper provides a promising step towards addressing the lack of language technology for Galician, but a few limitations and areas for further research are worth noting. The training data used for the Galician LLMs, while substantial, may still be relatively small compared to the data used for LLMs in major languages. This could limit the models' performance, especially on more specialized or technical language domains.

Additionally, the researchers acknowledge the need for further evaluation of the models' performance in real-world applications, such as [internal link: https://aimodels.fyi/papers/arxiv/teenytinyllama-open-source-tiny-language-models-trained] machine translation or content generation. Deployment and user feedback would provide valuable insights to refine the models and their usefulness for Galician speakers.

Future research could also explore ways to [internal link: https://aimodels.fyi/papers/arxiv/sambalingo-teaching-large-language-models-new-languages] efficiently adapt and scale these Galician LLMs to other low-resource languages, leveraging the insights and techniques developed in this work.

Conclusion

This paper demonstrates the feasibility and potential benefits of developing open-source generative LLMs for low-resource languages like Galician. By making these models publicly available, the researchers hope to spark further innovation and better language technologies for Galician and other minority languages, ultimately empowering their speakers and preserving linguistic diversity.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

šŸ’¬

Investigating the translation capabilities of Large Language Models trained on parallel data only

Javier Garc'ia Gilabert, Carlos Escolano, Aleix Sant Savall, Francesca De Luca Fornaciari, Audrey Mash, Xixian Liao, Maite Melero

YC

0

Reddit

0

In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.

Read more

6/14/2024

šŸ’¬

LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language

Cagri Toraman

YC

0

Reddit

0

Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.

Read more

5/14/2024

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

YC

0

Reddit

0

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely GenTranslate, which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

Read more

5/17/2024

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

Nicholas Kluge Corr^ea, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira

YC

0

Reddit

0

Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. While most LLMs are trained in high-resource languages like English, multilingual models generally underperform monolingual ones. Additionally, aspects of their multilingual foundation sometimes restrict the byproducts they produce, like computational demands and licensing regimes. In this study, we document the development of open-foundation models tailored for use in low-resource settings, their limitations, and their benefits. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development. See https://github.com/Nkluge-correa/TeenyTinyLlama

Read more

5/20/2024