Word Embeddings Are Steers for Language Models

Read original: arXiv:2305.12798 - Published 6/7/2024 by Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang, Tarek Abdelzaher, Heng Ji

💬

Overview

Language models (LMs) automatically learn word embeddings during pre-training on language corpora.
Word embeddings are usually interpreted as feature vectors for individual words, but their roles in language model generation remain underexplored.
This work theoretically and empirically revisits output word embeddings and finds that their linear transformations are equivalent to steering language model generation styles.
The authors name such steers "LM-Steers" and find them existing in LMs of all sizes, requiring only 0.2% of the original LM's size to learn.

Plain English Explanation

Language models are AI systems that can generate human-like text by learning patterns from large datasets of text. During this training process, the language model also learns word embeddings - numerical representations of words that capture their meaning and relationships.

Typically, these word embeddings are seen as just feature vectors for individual words, but the paper explores how they can also be used to steer the style and tone of the text generated by the language model.

The researchers call these steering mechanisms "LM-Steers" and find that they exist in language models of all sizes. Interestingly, it only requires a small fraction (0.2%) of the original language model's size to learn these steering parameters.

The LM-Steers allow the language model to be guided towards different writing styles, such as making the text more positive or less toxic. This can be useful for tasks like detoxifying text or controlling the sentiment of the generated text.

Moreover, the LM-Steers can reveal insights about how the language model's internal word embeddings are associated with the generated text styles. They can even be transferred between different language models, allowing for easy customization and style control.

Technical Explanation

The paper theoretically and empirically explores the role of output word embeddings in language model generation. The researchers find that the linear transformations of these output word embeddings are equivalent to "steering" the language model's generation style. They refer to these transformations as "LM-Steers".

Through experimentation, the authors demonstrate that LM-Steers exist in language models of all sizes, and only require learning parameters equal to 0.2% of the original LM's size to steer each generation style. This allows for efficient and effective control over the language model's output.

On tasks such as language model detoxification and sentiment control, the LM-Steers can achieve comparable or superior performance compared to state-of-the-art controlled generation methods, while maintaining a better balance with generation quality.

The paper also presents the LM-Steer as a lens into the interpretability of word embeddings when associated with language model generations. The LM-Steer can highlight text spans that most indicate the style differences, providing insights into the internal representations of the language model.

Additionally, the authors demonstrate that the LM-Steer is transferrable between different language models through an explicit form calculation. This allows for continuous steering of language models by scaling the LM-Steer or composing multiple LM-Steers.

Critical Analysis

The paper presents a compelling approach to steering language model generation through the use of LM-Steers, which leverage the linguistic information captured in the language model's output word embeddings. The authors provide a thorough theoretical and empirical analysis, demonstrating the effectiveness of this method on tasks like text detoxification and sentiment control.

One potential limitation, however, is the reliance on the underlying linguistic representations of the language model. While the paper shows that the LM-Steers can reveal insights about these representations, it remains to be seen how well the approach would generalize to language models with different architectures or training processes, which may encode linguistic information differently.

Additionally, the paper does not explore the potential biases or limitations of the LM-Steers themselves. As with any text generation system, there may be inherent biases or blindspots in the way the LM-Steers steer the language model's output, which could be an area for further investigation.

Overall, the research presented in this paper is a valuable contribution to the understanding and control of language model generation, offering a novel approach to steering text styles and generating insights about the language model's internal representations. As the field of large language models continues to evolve, studies like this will be crucial for developing more interpretable and controllable text generation systems.

Conclusion

This paper introduces a novel concept called "LM-Steers" - linear transformations of a language model's output word embeddings that can be used to steer the style and tone of the generated text. The researchers find that LM-Steers exist in language models of all sizes and only require a small fraction of the original model's parameters to learn.

The LM-Steers have proven effective in tasks like language model detoxification and sentiment control, outperforming state-of-the-art methods while maintaining generation quality. Importantly, the LM-Steers also serve as a lens into the interpretability of word embeddings and their relationship to text styles.

This research opens up new possibilities for customizing and controlling the output of large language models, with applications ranging from personalized content generation to safer and more ethical text production. As the field of AI-powered text generation continues to evolve, studies like this will be crucial for developing more versatile and interpretable language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Word Embeddings Are Steers for Language Models

Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang, Tarek Abdelzaher, Heng Ji

Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles. We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs' size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality. The learned LM-Steer serves as a lens in text styles: it reveals that word embeddings are interpretable when associated with language model generations and can highlight text spans that most indicate the style differences. An LM-Steer is transferrable between different language models by an explicit form calculation. One can also continuously steer LMs simply by scaling the LM-Steer or compose multiple LM-Steers by adding their transformations. Our codes are publicly available at url{https://github.com/Glaciohound/LM-Steer}.

6/7/2024

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, Jinghui Chen

Researchers have been studying approaches to steer the behavior of Large Language Models (LLMs) and build personalized LLMs tailored for various applications. While fine-tuning seems to be a direct solution, it requires substantial computational resources and may significantly affect the utility of the original LLM. Recent endeavors have introduced more lightweight strategies, focusing on extracting steering vectors to guide the model's output toward desired behaviors by adjusting activations within specific layers of the LLM's transformer architecture. However, such steering vectors are directly extracted from the activations of human preference data and thus often lead to suboptimal results and occasional failures, especially in alignment-related scenarios. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs, thereby offering a more precise representation of the target behavior. By carefully adjusting the direction and magnitude of the steering vector, we enabled personalized control over the desired behavior across a spectrum of intensities. Extensive experimentation across various open-ended generation tasks, particularly focusing on steering AI personas, has validated the efficacy of our approach. Moreover, we comprehensively investigate critical alignment-concerning scenarios, such as managing truthfulness, mitigating hallucination, and addressing jailbreaking attacks. Remarkably, our method can still demonstrate outstanding steering effectiveness across these scenarios. Furthermore, we showcase the transferability of our steering vectors across different models/LoRAs and highlight the synergistic benefits of applying multiple vectors simultaneously.

7/31/2024

Representation Surgery: Theory and Practice of Affine Steering

Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.

6/6/2024

Analyzing the Generalization and Reliability of Steering Vectors -- ICML 2024

Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk

Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain many technical difficulties of applying steering vectors to guide models' behaviour at scale.

7/23/2024