Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars

Read original: arXiv:2404.07673 - Published 4/12/2024 by Andr'es Lou, Juan Antonio P'erez-Ortiz, Felipe S'anchez-Mart'inez, V'ictor M. S'anchez-Cartagena

Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars

Overview

This paper focuses on developing curated datasets and neural models for machine translation between Mayan and Spanish vernacular languages.
The researchers aim to improve the translation of informal, colloquial language between these language pairs, which is an understudied area.
The paper provides an overview of Mayan languages, reviews related work, presents the dataset curation process and neural model architectures, and discusses the results and limitations of the research.

Plain English Explanation

The researchers in this paper wanted to improve machine translation between Mayan and Spanish languages, particularly for informal, everyday language. Mayan languages are a group of indigenous languages spoken in parts of Mexico, Guatemala, Belize, and Honduras. While there has been a lot of work on machine translation for formal, written language, translating casual, conversational speech between Mayan and Spanish is much more challenging.

To tackle this problem, the researchers first gathered collections of informal Mayan and Spanish text from various online sources. They then used these curated datasets to train advanced neural network models specifically designed for translating informal speech and slang between the two language groups.

The paper provides background on the Mayan language family and reviews other relevant research in this area. It then describes the dataset curation process and the neural network architectures the team developed. Finally, the researchers discuss the results of their experiments and acknowledge some of the remaining limitations and areas for future work.

The key contribution of this paper is the creation of high-quality training datasets and specialized machine translation models to improve communication between Mayan and Spanish-speaking communities, particularly for casual, conversational language use. This can have important real-world applications in areas like customer service, healthcare, and education.

Technical Explanation

The paper first provides an overview of the Mayan language family, which consists of over 30 distinct languages spoken by millions of people in Mesoamerica. It then reviews related work in the area of machine translation for low-resource language pairs, including efforts to create datasets and models for languages like Sanskrit and English as well as Brazilian Portuguese and other languages.

To develop their curated datasets, the researchers collected informal text from online sources like social media, blogs, and chat forums in both Mayan and Spanish. They used a variety of techniques to clean, filter, and align the data to create high-quality parallel corpora for training machine translation models.

The paper then describes the neural network architectures the team employed, which build on recent advances in cross-lingual language models and retrieval-augmented translation. These models are designed to effectively translate colloquial, context-dependent language between the Mayan and Spanish vernaculars.

Experiments on benchmark datasets show that the proposed models outperform standard machine translation approaches, particularly for informal language. However, the researchers acknowledge that further work is needed to handle regional dialects, code-switching, and other complexities of Mayan-Spanish communication.

Critical Analysis

The researchers have made a valuable contribution by focusing on the understudied problem of translating informal, conversational language between Mayan and Spanish. The creation of high-quality datasets and specialized neural models for this task is an important step forward.

That said, the paper could have provided more details on the specific challenges involved in working with Mayan languages, which have diverse dialects, complex grammatical structures, and limited digital resources. The researchers also did not deeply explore potential biases or fairness issues that could arise when deploying these translation systems in real-world settings.

Additionally, while the experimental results are promising, the researchers note that further work is needed to handle phenomena like code-switching, regional slang, and other complexities of Mayan-Spanish communication. Expanding the language coverage, improving robustness, and conducting more thorough evaluations would strengthen the overall research.

Overall, this paper makes a solid contribution to the field of low-resource machine translation, but there are opportunities to build on this work and address some of the remaining challenges and limitations.

Conclusion

This paper presents an important effort to develop curated datasets and neural models for improving machine translation between Mayan and Spanish vernacular languages. By focusing on informal, colloquial language, the researchers are addressing a critical gap in existing translation technologies.

The creation of high-quality parallel corpora and specialized neural architectures for this task is a valuable contribution that can have real-world impact in areas like healthcare, education, and customer service where effective communication between Mayan and Spanish speakers is essential.

While the results are promising, the researchers acknowledge that further work is needed to handle the complexities of Mayan-Spanish language use, including regional dialects, code-switching, and other nuances. Addressing these challenges and expanding the scope of the research could lead to even more impactful advancements in low-resource machine translation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars

Andr'es Lou, Juan Antonio P'erez-Ortiz, Felipe S'anchez-Mart'inez, V'ictor M. S'anchez-Cartagena

The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at https://github.com/transducens/mayanv.

4/12/2024

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

Mar'ia Grandury

We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.

7/26/2024

🖼️

Tagengo: A Multilingual Chat Dataset

Peter Devine

Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually. We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language. These results indicate the necessity of training on large amounts of high quality multilingual data to make a more accessible LLM.

5/22/2024

Open Generative Large Language Models for Galician

Pablo Gamallo, Pablo Rodr'iguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, Jos'e Ramom Pichel, Marcos Garcia

Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.

6/21/2024