Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

2405.14277

Published 5/24/2024 by Sabri Boughorbel, MD Rizwan Parvez, Majd Hawasly

💬

Abstract

Training LLMs in low resources languages usually utilizes data augmentation with machine translation (MT) from English language. However, translation brings a number of challenges: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions, the translated content carries over cultural biases, and if the translation is not faithful and accurate, the quality of the data degrades causing issues in the trained model. In this work we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the free NLLB-3B MT model. We train a number of story generation models of sizes 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories, representing 1% of the original training data, using a capable LLM in Arabic. We show using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability that the suggested approach is a practical means to resolve some of the translation pitfalls. We illustrate the improvement through case studies of linguistic issues and cultural bias.

Create account to get full access

Overview

Explores the use of machine translation (MT) for data augmentation in training language models for low-resource languages
Highlights challenges with MT, such as high costs, cultural biases, and degradation of data quality
Investigates the role of translation and synthetic data in training story generation models

Plain English Explanation

Training language models for low-resource languages often relies on using machine translation (MT) to translate data from a high-resource language like English. However, this approach comes with several challenges. Translating large amounts of content with high-quality MT solutions can be very expensive. Additionally, the translated content may carry over cultural biases from the source language, and if the translation is not accurate, the quality of the training data can degrade, leading to issues in the trained model.

In this paper, the researchers investigate these challenges by training story generation models using a dataset of 2.2 million short stories for young children, translated from English to Arabic using a free MT model. They find a number of quality and task-specific issues in the resulting models. To address these problems, the researchers further pre-train the models with a small dataset of high-quality, synthesized stories in Arabic, using a capable large language model (LLM).

The researchers show that this approach, using a combination of MT-based data augmentation and fine-tuning with synthetic data, can help resolve some of the translation pitfalls. They demonstrate the improvements through case studies of linguistic issues and cultural bias in the generated stories.

Technical Explanation

The researchers investigate the role of translation and synthetic data in training language models, specifically for story generation. They use the TinyStories dataset, which contains 2.2 million short stories for 3-4 year old children, and translate it from English to Arabic using the free NLLB-3B MT model.

The researchers train a number of story generation models with varying sizes (1M-33M parameters) using the translated data. They identify various quality and task-specific issues in these models, such as linguistic problems and cultural biases. To address these issues, the researchers further pre-train the models with a small dataset of synthesized high-quality stories in Arabic, representing 1% of the original training data. This synthetic data is generated using a capable LLM.

The researchers use GPT-4 as a judge to evaluate the quality of the generated stories and perform dictionary learning analysis from mechanistic interpretability to understand the improvements. They show that the suggested approach of combining MT-based data augmentation and fine-tuning with synthetic data is a practical way to resolve some of the translation pitfalls.

Critical Analysis

The researchers acknowledge the limitations of their approach, such as the potential for the synthetic data to introduce new biases or issues. Additionally, the paper does not provide a comprehensive comparison of their method to other techniques for addressing the challenges of training language models in low-resource languages, such as using machine translation to augment multilingual classification, adapting open-source generative large language models, or eliciting translation ability in large language models.

Further research could explore the scalability of the proposed approach, the long-term effects of fine-tuning with synthetic data, and the potential to fine-tune large language models to translate directly from the low-resource language, rather than relying on translation from English.

Conclusion

This paper investigates the use of machine translation and synthetic data for training language models in low-resource languages. The researchers demonstrate that combining MT-based data augmentation with fine-tuning on a small dataset of high-quality, synthesized stories can help address some of the challenges associated with translation, such as cultural biases and degradation of data quality. The insights from this work can inform future research and development of language models for underrepresented languages, with the goal of improving accessibility and inclusivity in the field of natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Jiaxin Guo, Hao Yang, Zongyao Li, Daimeng Wei, Hengchao Shang, Xiaoyu Chen

This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

4/16/2024

cs.CL

🏷️

Using Machine Translation to Augment Multilingual Classification

Adam King

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.

5/10/2024

cs.CL

💬

LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language

Cagri Toraman

Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.

5/14/2024

cs.CL cs.AI

💬

New!Continual Learning Under Language Shift

Evangelia Gogoulou, Timoth'ee Lesort, Magnus Boman, Joakim Nivre

The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic, and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for three different model sizes. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be positive or negative depending on the order and characteristics of new languages. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity best fits our results.

6/28/2024

cs.CL cs.LG