Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Read original: arXiv:2407.03145 - Published 7/4/2024 by Minato Kondo, Takehito Utsuro, Masaaki Nagata

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Overview

This paper explores a novel approach to enhancing the translation capabilities of large language models (LLMs) through continual pre-training on parallel data.
The researchers investigate how continuously pre-training LLMs on aligned text pairs from diverse language domains can improve their performance on translation tasks.
The proposed method aims to address the limitations of LLMs trained on only translated data by further fine-tuning them on parallel corpora, which can capture more nuanced cross-lingual relationships.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. However, their translation capabilities can be limited, especially when they are trained solely on translated data. This paper presents a new technique to improve the translation performance of LLMs by continuously pre-training them on parallel data - text that has been professionally translated between languages.

The key idea is that by continuously exposing the LLM to high-quality parallel data, the model can learn more nuanced cross-lingual relationships and patterns. This helps the model better understand the subtle differences between languages and produce more accurate translations.

The researchers tested their approach on several popular LLMs, including models discussed in related papers, models that use continual pre-training, and models trained on translated data. They found that their continual pre-training technique significantly improved the translation accuracy of these LLMs, outperforming previous approaches.

The benefits of this method are twofold: it can enhance the translation capabilities of existing LLMs, and it provides a new pathway for developing more powerful cross-lingual AI systems in the future. By continually refining LLMs on high-quality parallel data, researchers can investigate the translation capabilities of large language models and explore efficient ways to adapt them for cross-lingual tasks.

Technical Explanation

The researchers propose a continual pre-training approach to enhance the translation capabilities of large language models (LLMs). The key steps of their method are:

Pre-training on Monolingual Data: The researchers start with an LLM that has been pre-trained on large monolingual text corpora, such as BERT or RoBERTa.
Continual Pre-training on Parallel Data: They then continuously pre-train the LLM on high-quality parallel data - text that has been professionally translated between languages. This helps the model learn more nuanced cross-lingual relationships and patterns.
Fine-tuning on Translation Tasks: Finally, the researchers fine-tune the continually pre-trained LLM on specific translation tasks, such as translating between English and other languages. This allows the model to further specialize its translation capabilities for the target domains.

The researchers evaluated their approach on several popular LLMs and found that the continual pre-training on parallel data significantly improved the translation accuracy of these models compared to previous methods, such as training on translated data alone or investigating translation capabilities without continual pre-training.

Critical Analysis

The researchers acknowledge several limitations and areas for further research in their paper:

The effectiveness of the continual pre-training approach may depend on the quality and diversity of the parallel data used. Further investigation is needed to understand the optimal data requirements for different language pairs and domains.
While the proposed method can enhance the translation capabilities of existing LLMs, it may not be as efficient as developing new models specifically designed for cross-lingual adaptation. The tradeoffs between performance and computational cost should be explored.
The paper focuses on improving translation accuracy, but other aspects of translation quality, such as fluency and naturalness, could also be investigated in future studies.

Overall, the researchers present a compelling approach to improving the translation capabilities of large language models. By leveraging high-quality parallel data through continual pre-training, they demonstrate a promising path for advancing cross-lingual AI systems.

Conclusion

This paper introduces a novel technique for enhancing the translation accuracy of large language models (LLMs) through continual pre-training on parallel data. The key insight is that continuously exposing LLMs to professionally translated text can help them learn more nuanced cross-lingual relationships, leading to improved translation performance.

The researchers show that their approach outperforms previous methods, such as training on translated data alone or investigating translation capabilities without continual pre-training. This work offers a new direction for developing more powerful and versatile cross-lingual AI systems, with potential applications in machine translation, multilingual language understanding, and beyond.

While the paper highlights some limitations and areas for further research, the presented continual pre-training technique represents an important step forward in advancing the translation capabilities of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Minato Kondo, Takehito Utsuro, Masaaki Nagata

In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.

7/4/2024

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Jiaxin Guo, Hao Yang, Zongyao Li, Daimeng Wei, Hengchao Shang, Xiaoyu Chen

This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

4/16/2024

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki

Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.

4/30/2024

💬

Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

Sabri Boughorbel, MD Rizwan Parvez, Majd Hawasly

Training LLMs for low-resource languages usually utilizes data augmentation from English using machine translation (MT). This, however, brings a number of challenges to LLM training: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, data quality degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality Arabic stories generated by a capable LLM, representing 1% of the original training data. We show, using GPT-4 as a judge and Dictionary Learning Analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the machine translation pitfalls. We illustrate the improvements through case studies of linguistic and cultural bias issues.

8/9/2024