The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights

2405.01345

Published 5/3/2024 by Wenhao Zhu, Shujian Huang, Fei Yuan, Cheng Chen, Jiajun Chen, Alexandra Birch

The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights

Abstract

Bridging the significant gap between large language model's English and non-English performance presents a great challenge. While some previous studies attempt to mitigate this gap with translated training data, the recently proposed question alignment approach leverages the model's English expertise to improve multilingual performance with minimum usage of expensive, error-prone translation. In this paper, we explore how broadly this method can be applied by examining its effects in reasoning with executable code and reasoning with common sense. We also explore how to apply this approach efficiently to extremely large language models using proxy-tuning. Experiment results on multilingual reasoning benchmarks mGSM, mSVAMP and xCSQA demonstrate that the question alignment approach can be used to boost multilingual performance across diverse reasoning scenarios, model families, and sizes. For instance, when applied to the LLaMA2 models, our method brings an average accuracy improvements of 12.2% on mGSM even with the 70B model. To understand the mechanism of its success, we analyze representation space, chain-of-thought and translation data scales, which reveals how question translation training strengthens language alignment within LLMs and shapes their working patterns.

Create account to get full access

Overview

This paper explores the use of question translation training to improve multilingual reasoning capabilities in large language models.
The researchers investigate how question translation training can broaden the scope and deepen the insights of multilingual reasoning, going beyond prior work on cross-lingual transfer.
The paper presents experiments and analyses that demonstrate the effectiveness of this approach in enhancing the multilingual capabilities of language models.

Plain English Explanation

The paper focuses on a technique called "question translation training" to help language models better understand and reason about questions in multiple languages. This builds on previous work on cross-lingual transfer, where models were trained to answer questions in one language using training data from another language.

The key idea here is that by training the model to translate questions from one language to another, it can learn to better comprehend the meaning and intent behind those questions, even in languages it wasn't directly trained on. This multilingual alignment allows the model to reason more effectively across languages.

The researchers conduct experiments to show how this question translation training expands the model's capabilities, allowing it to tackle a broader range of multilingual tasks and gain deeper insights. For example, they find that a model trained this way can outperform larger language models on certain multilingual reasoning benchmarks.

The paper also explores ways to elicit strong translation abilities in large language models through this training approach. Overall, the findings suggest that question translation is a powerful technique for improving multilingual pretraining and instruction tuning to enhance cross-lingual understanding and reasoning.

Technical Explanation

The researchers propose a novel training approach called "question translation training" to improve the multilingual reasoning capabilities of large language models. This builds on prior work on cross-lingual transfer, where models were trained to answer questions in one language using training data from another language.

The key innovation here is training the model to translate questions from one language to another. This "multilingual alignment" allows the model to better comprehend the meaning and intent behind questions, even in languages it wasn't directly trained on. The researchers hypothesize that this can expand the model's capabilities, enabling it to tackle a broader range of multilingual tasks and gain deeper insights.

To test this, the researchers conduct a series of experiments comparing models trained with and without question translation. Their results show that the question translation approach leads to significant performance gains on multilingual reasoning benchmarks. Interestingly, they find that a model trained this way can even outperform larger language models that were not exposed to this training.

The paper also explores techniques for eliciting strong translation abilities in large language models through the question translation training. The researchers analyze the model's internal representations and decision-making processes to gain insights into how this training shapes its multilingual understanding.

Overall, the findings suggest that question translation training is a powerful technique for improving multilingual pretraining and instruction tuning to enhance cross-lingual understanding and reasoning capabilities in large language models.

Critical Analysis

The paper presents a well-designed and thorough investigation into the benefits of question translation training for multilingual reasoning. The experimental results are compelling and the analysis provides valuable insights into the underlying mechanisms at play.

One potential limitation is the reliance on a relatively small set of benchmark datasets, which may not fully capture the breadth of real-world multilingual language understanding challenges. It would be interesting to see how the approach fares on a more diverse range of tasks and domains.

Additionally, the paper does not delve deeply into the potential biases or limitations introduced by the question translation training process. It would be important to further explore how this training may amplify or introduce biases, and how to mitigate such issues.

Another area for further research could be investigating the transferability of the question translation training to other language models and architectures. Exploring the generalizability of this approach would strengthen the broader applicability of the findings.

Overall, this paper makes a valuable contribution to the field of multilingual language understanding and reasoning, and the authors' thoughtful analysis and experimental design set a strong foundation for future work in this area.

Conclusion

This paper presents a novel approach called "question translation training" that significantly improves the multilingual reasoning capabilities of large language models. By training models to translate questions from one language to another, the researchers demonstrate that this technique can broaden the scope and deepen the insights of multilingual understanding.

The findings suggest that question translation training is a powerful tool for enhancing cross-lingual transfer and alignment, allowing language models to better comprehend the meaning and intent behind questions across multiple languages. This has important implications for a wide range of applications, from multilingual chatbots to cross-lingual information retrieval.

The paper's thorough experimental evaluation and insightful analysis provide a solid foundation for further research in this area. Exploring the generalizability of this approach, addressing potential biases, and expanding the scope of evaluation tasks are all important next steps to build on the contributions of this work.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏋️

New!Question Translation Training for Better Multilingual Reasoning

Wenhao Zhu, Shujian Huang, Fei Yuan, Shuaijie She, Jiajun Chen, Alexandra Birch

Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. This approach not only incurs high cost, but also results in poorly translated data due to the non-standard formatting of mathematical chain-of-thought. In this paper, we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data. In this way we perform targeted, in-domain language alignment which makes best use of English instruction data to unlock the LLMs' multilingual reasoning abilities. Experimental results on LLaMA2-13B show that question alignment leads to consistent improvements over the translate-training approach: an average improvement of 11.3% and 16.1% accuracy across ten languages on the MGSM and MSVAMP multilingual reasoning benchmarks. The project will be available at: https://github.com/NJUNLP/QAlign.

7/2/2024

cs.CL

Eliciting Better Multilingual Structured Reasoning from LLMs through Code

Bryan Li, Tamer Alkhouli, Daniele Bonadiman, Nikolaos Pappas, Saab Mansour

The development of large language models (LLM) has shown progress on reasoning, though studies have largely considered either English or simple reasoning tasks. To address this, we introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages. xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks. We then propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners. First, at training time, we augment a code dataset with multilingual comments using machine translation while keeping program code as-is. Second, at inference time, we bridge the gap between training and inference by employing a prompt structure that incorporates step-by-step code primitives to derive new facts and find a solution. Our methods show improved multilingual performance on xSTREET, most notably on the scientific commonsense reasoning subtask. Furthermore, the models show no regression on non-reasoning tasks, thus demonstrating our techniques maintain general-purpose abilities.

6/13/2024

cs.CL cs.AI

💬

Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

Chong Li, Shaonan Wang, Jiajun Zhang, Chengqing Zong

Multilingual generative models obtain remarkable cross-lingual in-context learning capabilities through pre-training on large-scale corpora. However, they still exhibit a performance bias toward high-resource languages and learn isolated distributions of multilingual sentence representations, which may hinder knowledge transfer across languages. To bridge this gap, we propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences. It aligns the internal sentence representations across different languages via multilingual contrastive learning and aligns outputs by following cross-lingual instructions in the target language. Experimental results show that even with less than 0.1 {textperthousand} of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models and mitigates the performance gap. Further analyses reveal that it results in a better internal multilingual representation distribution of multilingual models.

6/13/2024

cs.CL

🔮

On the Calibration of Multilingual Question Answering LLMs

Yahan Yang, Soham Dan, Dan Roth, Insup Lee

Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety of QA tasks. We perform extensive experiments, spanning encoder-only, encoder-decoder, and decoder-only QA models (size varying from 110M to 7B parameters) and diverse languages, including both high- and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data. We also conduct several ablation experiments to study the effect of language distances, language corpus size, and model size on calibration, and how multilingual models compare with their monolingual counterparts for diverse tasks and languages. Our experiments suggest that the multilingual QA models are poorly calibrated for languages other than English and incorporating a small set of cheaply translated multilingual samples during fine-tuning/calibration effectively enhances the calibration performance.

4/16/2024

cs.CL cs.LG