Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the S'ami Language

Read original: arXiv:2405.05777 - Published 5/10/2024 by Ronny Paul, Himanshu Buckchash, Shantipriya Parida, Dilip K. Prasad
Total Score

0

Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the S'ami Language

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper explores the progress and perspectives in training large language models (LLMs) for the Sámi language, a minority language spoken by the indigenous Sámi people in Scandinavia.
  • The study focuses on the Northern Sámi dialect, which is the most widely used among native Sámi speakers.
  • The researchers aim to make AI more inclusive by developing language models that can support lesser-known languages like Sámi.

Plain English Explanation

The paper looks at the progress made in training large language models, which are AI systems that can understand and generate human-like text, for the Sámi language. Sámi is a minority language spoken by the indigenous Sámi people in Scandinavia.

The researchers focused on the Northern Sámi dialect, as it is the most commonly used among native Sámi speakers. The goal is to make AI more inclusive by developing language models that can support smaller, lesser-known languages like Sámi, rather than just focusing on the most widely spoken languages.

This is important because it can help ensure that AI technology is accessible and beneficial to all people, not just those who speak the most common languages. By training language models for minority languages, the researchers are working to make AI more representative and inclusive.

Technical Explanation

The paper describes the progress made in training large language models (LLMs) for the Sámi language, specifically the Northern Sámi dialect. LLMs are AI systems that can understand and generate human-like text, and they are often trained on vast amounts of data in order to achieve this capability.

The researchers trained their Sámi language models using a variety of techniques, including link to SambaLingo: Teaching Large Language Models New Languages and link to Large Language Models for Expansion of Spoken Language Understanding. They evaluated the performance of these models on a range of tasks, such as link to How Good Are Large Language Models for African Languages? and link to LAMI: Large Language Models for Multi-Modal Human-AI Interaction.

The insights gained from this research can help inform the development of more inclusive AI systems that can support minority languages like Sámi, as well as link to Walia: Enhancing Amharic LLMs by Integrating with LLAMA and other underrepresented languages.

Critical Analysis

The paper acknowledges several caveats and limitations of the research, such as the challenges in obtaining sufficient training data for minority languages like Sámi. The authors also note that further research is needed to fully understand the performance and capabilities of the trained language models in real-world applications.

One potential issue that the paper does not address is the cultural and historical context of the Sámi language and its speakers. It would be important to consider how the development and deployment of these language models could impact the Sámi community, both positively and negatively.

Additionally, the paper could have discussed the ethical considerations around developing AI systems for minority languages, such as ensuring the models are not used in ways that could harm or exploit these communities.

Overall, the research presented in this paper is a valuable step towards making AI more inclusive, but there are still important questions and concerns that need to be addressed as this work progresses.

Conclusion

This paper represents an important contribution to the field of inclusive AI by exploring the development of large language models for the Sámi language, a minority language spoken by the indigenous Sámi people in Scandinavia.

By focusing on the Northern Sámi dialect, the researchers are working to ensure that AI technology can support and benefit smaller, lesser-known languages, rather than just the most widely spoken ones. This is a crucial step towards making AI more representative and accessible to all people, regardless of their linguistic background.

The insights gained from this research can inform the development of similar language models for other minority languages, helping to create a more equitable and inclusive AI ecosystem. As the field of AI continues to evolve, studies like this one will be essential in ensuring that the benefits of these technologies are shared across diverse communities around the world.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the S'ami Language
Total Score

0

Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the S'ami Language

Ronny Paul, Himanshu Buckchash, Shantipriya Parida, Dilip K. Prasad

S'ami, an indigenous language group comprising multiple languages, faces digital marginalization due to the limited availability of data and sophisticated language models designed for its linguistic intricacies. This work focuses on increasing technological participation for the S'ami language. We draw the attention of the ML community towards the language modeling problem of Ultra Low Resource (ULR) languages. ULR languages are those for which the amount of available textual resources is very low, and the speaker count for them is also very low. ULRLs are also not supported by mainstream Large Language Models (LLMs) like ChatGPT, due to which gathering artificial training data for them becomes even more challenging. Mainstream AI foundational model development has given less attention to this category of languages. Generally, these languages have very few speakers, making it hard to find them. However, it is important to develop foundational models for these ULR languages to promote inclusion and the tangible abilities and impact of LLMs. To this end, we have compiled the available S'ami language resources from the web to create a clean dataset for training language models. In order to study the behavior of modern LLM models with ULR languages (S'ami), we have experimented with different kinds of LLMs, mainly at the order of $sim$ seven billion parameters. We have also explored the effect of multilingual LLM training for ULRLs. We found that the decoder-only models under a sequential multilingual training scenario perform better than joint multilingual training, whereas multilingual training with high semantic overlap, in general, performs better than training from scratch.This is the first study on the S'ami language for adapting non-statistical language models that use the latest developments in the field of natural language processing (NLP).

Read more

5/10/2024

SambaLingo: Teaching Large Language Models New Languages
Total Score

0

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

Read more

7/19/2024

💬

Total Score

0

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Jakub Hoscilowicz, Pawel Pawlowski, Marcin Skorupa, Marcin Sowa'nski, Artur Janicki

Spoken Language Understanding (SLU) models are a core component of voice assistants (VA), such as Alexa, Bixby, and Google Assistant. In this paper, we introduce a pipeline designed to extend SLU systems to new languages, utilizing Large Language Models (LLMs) that we fine-tune for machine translation of slot-annotated SLU training data. Our approach improved on the MultiATIS++ benchmark, a primary multi-language SLU dataset, in the cloud scenario using an mBERT model. Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%, compared to the existing state-of-the-art method, Fine and Coarse-grained Multi-Task Learning Framework (FC-MTLF). In the on-device scenario (tiny and not pretrained SLU), our method improved the Overall Accuracy from 5.31% to 22.06% over the baseline Global-Local Contrastive Learning Framework (GL-CLeF) method. Contrary to both FC-MTLF and GL-CLeF, our LLM-based machine translation does not require changes in the production architecture of SLU. Additionally, our pipeline is slot-type independent: it does not require any slot definitions or examples.

Read more

4/4/2024

📊

Total Score

0

Socially Responsible Data for Large Multilingual Language Models

Andrew Smart, Ben Hutchinson, Lameck Mbangula Amugongo, Suzanne Dikker, Alex Zito, Amber Ebinama, Zara Wudiri, Ding Wang, Erin van Liemt, Jo~ao Sedoc, Seyi Olojo, Stanley Uwakwe, Edem Wornyo, Sonja Schmer-Galunder, Jamila Smith-Loud

Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving for models to accommodate languages of communities outside of the Global North, which include many languages that have been historically underrepresented in digital realms. These languages have been coined as low resource languages or long-tail languages, and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community partnerships, and participatory design approaches. We provide twelve recommendations for consideration when collecting language data on underrepresented language communities outside of the Global North.

Read more

9/10/2024