Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

2404.12318

Published 4/19/2024 by Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami

📈

Abstract

Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.

Create account to get full access

Overview

Aligning language models (LMs) based on human-annotated preference data is crucial for practical and performant LM-based systems.
Obtaining multilingual human preference data at scale is challenging, making it difficult to extend this alignment framework to diverse languages.
This paper evaluates a simple approach for zero-shot cross-lingual alignment, where a reward model trained on preference data in one source language is directly applied to other target languages.

Plain English Explanation

Language models (LMs) are AI systems that can generate human-like text. To make these LMs practical and effective, they need to be "aligned" with human preferences - that is, trained to produce text that humans prefer.

However, gathering the necessary human preference data, especially across multiple languages, is very difficult. This paper looks at a simpler approach to cross-lingual alignment, where an LM is first aligned with human preferences in one language, and then that alignment is directly applied to other languages, without needing new preference data for those languages.

The researchers show that this "zero-shot" cross-lingual alignment approach works surprisingly well. The aligned LMs are often preferred by humans over unaligned LMs, across tasks like summarization and open-ended conversation. Interestingly, they even find that using a reward model trained on a different language can sometimes work better than using one trained on the same language.

This research provides a promising path forward for making LM alignment scalable to many languages, without the need for extensive human preference data in each one. It highlights best practices for doing this kind of cross-lingual transfer when no language-specific data is available even for supervised finetuning.

Technical Explanation

The paper evaluates a zero-shot approach to cross-lingual alignment of language models (LMs) using human preference data. Typically, aligning LMs to human preferences requires collecting preference data, which is difficult to do at scale across many languages.

The researchers instead train a "reward model" on preference data in a source language, and directly apply that reward model to LMs in other target languages. On tasks like summarization and open-ended dialog generation, they show this method consistently produces LMs that are preferred by humans compared to unaligned models - up to 70% of the time in their evaluations.

Interestingly, they find that using a reward model trained on a different language can sometimes yield better aligned models than using one trained on the same language. The paper also identifies best practices for cross-lingual transfer when no language-specific data is available even for supervised finetuning, another key component of the alignment process.

Critical Analysis

The paper presents a compelling approach to the challenge of scaling human preference alignment of language models to diverse languages. The zero-shot cross-lingual transfer method is elegant in its simplicity and the empirical results are quite strong.

However, the researchers acknowledge some key limitations. The evaluations are still relatively limited in scope, focused on a few specific tasks. It's unclear how well the findings would generalize to a wider range of applications and languages. There are also open questions about the stability and robustness of the cross-lingual transfer, which the paper does not deeply explore.

Additionally, while the paper identifies best practices for the zero-shot setup, the lack of even supervised finetuning data in some languages is a significant constraint. More research is needed on techniques to enable high-quality cross-lingual alignment with minimal language-specific data.

Overall, this work represents an important step forward, but there is still much to be explored in making human preference alignment scalable across the vast diversity of the world's languages.

Conclusion

This paper presents a promising approach to the challenge of scaling human preference alignment of language models to diverse languages. By training a reward model on preference data in one language and directly applying it to other languages, the researchers demonstrate consistent success in producing cross-lingually aligned models that are preferred by humans.

The findings suggest this zero-shot transfer method could be a valuable tool for making language model alignment more scalable and accessible across many languages, without the need for extensive human preference data in each one. The insights around using models trained on different languages, and best practices for dealing with a lack of even supervised finetuning data, provide a strong foundation for future research in this area.

As language models become increasingly influential, aligning them with human values and preferences is crucial. This work pushes forward our ability to do so in a way that can keep pace with the rapidly growing diversity of languages in the world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

Chong Li, Shaonan Wang, Jiajun Zhang, Chengqing Zong

Multilingual generative models obtain remarkable cross-lingual in-context learning capabilities through pre-training on large-scale corpora. However, they still exhibit a performance bias toward high-resource languages and learn isolated distributions of multilingual sentence representations, which may hinder knowledge transfer across languages. To bridge this gap, we propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences. It aligns the internal sentence representations across different languages via multilingual contrastive learning and aligns outputs by following cross-lingual instructions in the target language. Experimental results show that even with less than 0.1 {textperthousand} of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models and mitigates the performance gap. Further analyses reveal that it results in a better internal multilingual representation distribution of multilingual models.

6/13/2024

cs.CL

Probing the Emergence of Cross-lingual Alignment during LLM Training

Hetong Wang, Pasquale Minervini, Edoardo M. Ponti

Multilingual Large Language Models (LLMs) achieve remarkable levels of zero-shot cross-lingual transfer performance. We speculate that this is predicated on their ability to align languages without explicit supervision from parallel sentences. While representations of translationally equivalent sentences in different languages are known to be similar after convergence, however, it remains unclear how such cross-lingual alignment emerges during pre-training of LLMs. Our study leverages intrinsic probing techniques, which identify which subsets of neurons encode linguistic features, to correlate the degree of cross-lingual neuron overlap with the zero-shot cross-lingual transfer performance for a given model. In particular, we rely on checkpoints of BLOOM, a multilingual autoregressive LLM, across different training steps and model scales. We observe a high correlation between neuron overlap and downstream performance, which supports our hypothesis on the conditions leading to effective cross-lingual transfer. Interestingly, we also detect a degradation of both implicit alignment and multilingual abilities in certain phases of the pre-training process, providing new insights into the multilingual pretraining dynamics.

6/21/2024

cs.CL cs.AI cs.LG

ReMoDetect: Reward Models Recognize Aligned LLM's Generations

Hyunseok Lee, Jihoon Tack, Jinwoo Shin

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/reward_llm_detect.

5/28/2024

cs.LG cs.CL

💬

Aligning language models with human preferences

Tomasz Korbak

Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.

4/19/2024

cs.LG cs.CL