Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

Read original: arXiv:2405.14277 - Published 8/9/2024 by Sabri Boughorbel, MD Rizwan Parvez, Majd Hawasly

💬

Overview

Explores the use of machine translation (MT) for data augmentation in training language models for low-resource languages
Highlights challenges with MT, such as high costs, cultural biases, and degradation of data quality
Investigates the role of translation and synthetic data in training story generation models

Plain English Explanation

Training language models for low-resource languages often relies on using machine translation (MT) to translate data from a high-resource language like English. However, this approach comes with several challenges. Translating large amounts of content with high-quality MT solutions can be very expensive. Additionally, the translated content may carry over cultural biases from the source language, and if the translation is not accurate, the quality of the training data can degrade, leading to issues in the trained model.

In this paper, the researchers investigate these challenges by training story generation models using a dataset of 2.2 million short stories for young children, translated from English to Arabic using a free MT model. They find a number of quality and task-specific issues in the resulting models. To address these problems, the researchers further pre-train the models with a small dataset of high-quality, synthesized stories in Arabic, using a capable large language model (LLM).

The researchers show that this approach, using a combination of MT-based data augmentation and fine-tuning with synthetic data, can help resolve some of the translation pitfalls. They demonstrate the improvements through case studies of linguistic issues and cultural bias in the generated stories.

Technical Explanation

The researchers investigate the role of translation and synthetic data in training language models, specifically for story generation. They use the TinyStories dataset, which contains 2.2 million short stories for 3-4 year old children, and translate it from English to Arabic using the free NLLB-3B MT model.

The researchers train a number of story generation models with varying sizes (1M-33M parameters) using the translated data. They identify various quality and task-specific issues in these models, such as linguistic problems and cultural biases. To address these issues, the researchers further pre-train the models with a small dataset of synthesized high-quality stories in Arabic, representing 1% of the original training data. This synthetic data is generated using a capable LLM.

The researchers use GPT-4 as a judge to evaluate the quality of the generated stories and perform dictionary learning analysis from mechanistic interpretability to understand the improvements. They show that the suggested approach of combining MT-based data augmentation and fine-tuning with synthetic data is a practical way to resolve some of the translation pitfalls.

Critical Analysis

The researchers acknowledge the limitations of their approach, such as the potential for the synthetic data to introduce new biases or issues. Additionally, the paper does not provide a comprehensive comparison of their method to other techniques for addressing the challenges of training language models in low-resource languages, such as using machine translation to augment multilingual classification, adapting open-source generative large language models, or eliciting translation ability in large language models.

Further research could explore the scalability of the proposed approach, the long-term effects of fine-tuning with synthetic data, and the potential to fine-tune large language models to translate directly from the low-resource language, rather than relying on translation from English.

Conclusion

This paper investigates the use of machine translation and synthetic data for training language models in low-resource languages. The researchers demonstrate that combining MT-based data augmentation with fine-tuning on a small dataset of high-quality, synthesized stories can help address some of the challenges associated with translation, such as cultural biases and degradation of data quality. The insights from this work can inform future research and development of language models for underrepresented languages, with the goal of improving accessibility and inclusivity in the field of natural language processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

Sabri Boughorbel, MD Rizwan Parvez, Majd Hawasly

Training LLMs for low-resource languages usually utilizes data augmentation from English using machine translation (MT). This, however, brings a number of challenges to LLM training: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, data quality degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality Arabic stories generated by a capable LLM, representing 1% of the original training data. We show, using GPT-4 as a judge and Dictionary Learning Analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the machine translation pitfalls. We illustrate the improvements through case studies of linguistic and cultural bias issues.

8/9/2024

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Jiaxin Guo, Hao Yang, Zongyao Li, Daimeng Wei, Hengchao Shang, Xiaoyu Chen

This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

4/16/2024

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Minato Kondo, Takehito Utsuro, Masaaki Nagata

In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.

7/4/2024

🏷️

Using Machine Translation to Augment Multilingual Classification

Adam King

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.

5/10/2024