LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback

2406.01771

Published 6/5/2024 by Wen Lai, Mohsen Mesgar, Alexander Fraser

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback

Abstract

To democratize large language models (LLMs) to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low-resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human feedback dataset encompassing 30 languages. We perform multilingual instruction tuning on the constructed instruction data and further align the LLMs with human feedback using the DPO algorithm on our cross-lingual human feedback dataset. We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks. Experimental results show that xLLMs-100 consistently outperforms its peers across the benchmarks by considerable margins, defining a new state-of-the-art multilingual LLM that supports 100 languages.

Create account to get full access

Overview

This paper explores ways to improve the multilingual capabilities of large language models (LLMs) beyond just English.
The key idea is to use "cross-lingual feedback" to help LLMs learn new languages more effectively.
The authors propose several techniques, including combining language models, using parallel text data, and leveraging multilingual pretraining.
The goal is to create LLMs that can understand and generate content in a wider range of languages, not just English.

Plain English Explanation

Large language models (LLMs) like GPT-3 have shown impressive abilities in tasks like text generation and translation. However, most of these models are primarily focused on English, leaving their abilities in other languages somewhat limited.

This paper explores ways to improve the multilingual capabilities of LLMs. The key idea is to use "cross-lingual feedback" - that is, using information from one language to help the model learn another.

For example, the model might be shown a sentence in English and its translation in Spanish. By learning the relationship between the two, it can build a better understanding of both languages. Other research has also explored ways to share vocabulary and knowledge across languages to help LLMs become more multilingual.

The authors propose several techniques, like combining different language models, using parallel text data, and leveraging multilingual pretraining. The goal is to create LLMs that can understand and generate content in a wider range of languages, not just English. This could have big implications for making AI systems more globally accessible and inclusive.

Technical Explanation

The paper explores methods for scaling the multilingual capabilities of large language models (LLMs) beyond just English. The key idea is to leverage "cross-lingual feedback" - using information from one language to help the model learn another.

The authors propose several techniques:

Model Combination: Combining multiple monolingual LLMs, each trained on a different language, to create a single multilingual model. This allows the model to draw on the knowledge of each individual language.
Parallel Text Data: Using parallel text datasets, where the same content is available in multiple languages. This provides the model with direct translation examples to learn from.
Multilingual Pretraining: Pretraining the model on a large, multilingual corpus before fine-tuning on specific tasks. This allows the model to develop a stronger foundation in multiple languages early on.

The experiments demonstrate that these techniques can significantly improve the multilingual performance of LLMs, with substantial gains in tasks like machine translation and cross-lingual language understanding. The authors also discuss the potential limitations and future research directions in this area.

Critical Analysis

The paper presents an important step forward in improving the multilingual capabilities of large language models. The proposed techniques, such as model combination and leveraging parallel text data, seem well-motivated and show promising empirical results.

However, the authors acknowledge several limitations and areas for further research. For example, the performance gains may be uneven across different language pairs, and the model may still struggle with low-resource languages. Additional research is needed to address these challenges and further scale multilingual LLMs.

Additionally, the paper does not delve deeply into potential societal implications or ethical concerns. As these models become more multilingual and powerful, there will be important questions to consider around bias, fairness, and the equitable distribution of their benefits across different linguistic communities. Further work is needed to address these issues proactively.

Overall, this paper represents a valuable contribution to the field, but there is still significant room for improvement and further research to truly unlock the full potential of multilingual language models.

Conclusion

This paper explores innovative techniques to scale the multilingual capabilities of large language models beyond just English. By leveraging cross-lingual feedback, model combination, parallel text data, and multilingual pretraining, the authors demonstrate substantial performance gains in tasks like machine translation and cross-lingual understanding.

These findings have important implications for making AI systems more globally accessible and inclusive. As language models become more multilingual, they can break down linguistic barriers and enable more people around the world to benefit from the transformative power of language technology.

However, the paper also highlights the need for continued research to address the limitations and ethical considerations of these models. Ensuring that the benefits of multilingual LLMs are distributed equitably, and that these systems are developed responsibly, will be crucial as the field continues to advance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

Lynn Chua, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, Chulin Xie, Chiyuan Zhang

Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz) contexts. We observe that simple inference-time mitigation methods offer only limited improvement. On the other hand, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is publicly available at https://github.com/google-research/crosslingual-knowledge-barriers.

6/26/2024

cs.CL cs.LG

A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Yuqi Ye, Hanwen Gu

Based on the foundation of Large Language Models (LLMs), Multilingual Large Language Models (MLLMs) have been developed to address the challenges of multilingual natural language processing tasks, hoping to achieve knowledge transfer from high-resource to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities. Secondly, we explore widely utilized multilingual corpora for MLLMs' training and multilingual datasets oriented for downstream tasks that are crucial for enhancing the cross-lingual capability of MLLMs. Thirdly, we survey the existing studies on multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques. Finally, we discuss existing challenges and point out promising research directions. By demonstrating these aspects, this paper aims to facilitate a deeper understanding of MLLMs and their potentiality in various domains.

6/7/2024

cs.CL cs.AI

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

4/10/2024

cs.CL cs.AI cs.LG

🔮

How Vocabulary Sharing Facilitates Multilingualism in LLaMA?

Fei Yuan, Shuai Yuan, Zhiyong Wu, Lei Li

Large Language Models (LLMs), often show strong performance on English tasks, while exhibiting limitations on other languages. What is an LLM's multilingual capability when it is trained only on certain languages? The underlying mechanism remains unclear. This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective by conducting an exhaustive analysis across 101 languages. Through the investigation of the performance gap before and after embedding fine-tuning, we discovered four distinct quadrants. By delving into each quadrant we provide actionable and efficient guidelines for tuning these languages. Extensive experiments reveal that existing LLMs possess multilingual capabilities that surpass our expectations, and we can significantly improve the multilingual performance of LLMs based on these attributes of each quadrant~footnote{url{https://github.com/CONE-MT/Vocabulary-Sharing-Facilitates-Multilingualism}.}.

6/4/2024

cs.CL cs.AI