Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Read original: arXiv:2406.15518 - Published 6/26/2024 by Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Overview

This paper explores techniques to improve a language model's ability to be steered or controlled after deployment, without causing unintended side effects.
The authors propose several novel approaches, including personalized steering, word embedding-based steering, debiasing techniques, and activation addition steering.
The paper presents theoretical analysis and empirical results demonstrating the effectiveness of these techniques for controlling language model outputs while minimizing unintended consequences.

Plain English Explanation

The paper focuses on improving the ability to control or "steer" large language models after they have already been deployed and are being used. The goal is to allow users to guide the model's outputs in specific directions, without causing unexpected or undesirable side effects.

The authors propose several new techniques to achieve this. One approach is personalized steering, which allows the model to be tailored to individual users' preferences and needs. Another method involves using word embeddings as steering signals, essentially directing the model by providing certain words or concepts to focus on.

The researchers also explore debiasing techniques to remove unwanted biases from the model's outputs, and an "activation addition" approach that can steer the model without requiring retraining or optimization.

Overall, the goal is to give users more precise control over large language models after they have been deployed, while minimizing the risk of unintended consequences or undesirable outputs. This could have important implications for the safe and responsible use of these powerful AI systems.

Technical Explanation

The paper presents several novel techniques for improving post-deployment control of language models:

Personalized steering: The authors propose a method to steer language models based on individual user preferences, allowing the model's outputs to be customized for each person.

Word embedding-based steering: This approach uses word embeddings as steering signals, directing the model to focus on certain concepts or topics by providing relevant word representations.

Debiasing techniques: The researchers explore ways to remove harmful biases from language model outputs, using "universal" debiasing methods that can be applied without retraining the full model.

Activation addition steering: This technique allows for steering language models without requiring expensive optimization or retraining, by directly modifying the model's activations.

The paper presents theoretical analysis and empirical results demonstrating the effectiveness of these approaches for controlling language model outputs while minimizing unintended consequences. The authors also discuss the theory and practice of affine steering, providing a broader framework for understanding and improving post-deployment control of language models.

Critical Analysis

The paper presents a comprehensive set of techniques for improving post-deployment control of language models, addressing an important challenge in the safe and responsible use of these powerful AI systems. The authors provide thorough theoretical and empirical analyses, demonstrating the effectiveness of their proposed methods.

One potential limitation is that the evaluation is primarily focused on standard language modeling benchmarks, and the authors acknowledge the need for further testing in real-world applications and on more diverse datasets. Additionally, the paper does not extensively explore the computational and memory costs of the various steering approaches, which could be an important practical consideration.

While the paper makes a valuable contribution, there may be other avenues for research worth exploring, such as investigating the long-term stability and robustness of the steering methods, or exploring ways to further improve the interpretability and transparency of the controlled language model outputs.

Overall, this paper represents a significant step forward in addressing the critical issue of post-deployment control of language models, and the techniques presented could have important implications for the responsible development and use of these technologies.

Conclusion

This paper introduces several novel techniques for improving the ability to control or "steer" language models after they have been deployed, without causing unintended side effects. The proposed approaches, including personalized steering, word embedding-based steering, debiasing, and activation addition steering, demonstrate the potential to give users more precise control over language model outputs while minimizing the risk of unwanted or harmful consequences.

The theoretical analysis and empirical results presented in the paper suggest that these techniques could have important implications for the safe and responsible development and use of large language models, which are becoming increasingly prevalent in a wide range of applications. By addressing the challenge of post-deployment control, this research contributes to the ongoing efforts to ensure that these powerful AI systems are leveraged in a way that benefits society and respects ethical principles.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.

6/26/2024

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, Jinghui Chen

Researchers have been studying approaches to steer the behavior of Large Language Models (LLMs) and build personalized LLMs tailored for various applications. While fine-tuning seems to be a direct solution, it requires substantial computational resources and may significantly affect the utility of the original LLM. Recent endeavors have introduced more lightweight strategies, focusing on extracting steering vectors to guide the model's output toward desired behaviors by adjusting activations within specific layers of the LLM's transformer architecture. However, such steering vectors are directly extracted from the activations of human preference data and thus often lead to suboptimal results and occasional failures, especially in alignment-related scenarios. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs, thereby offering a more precise representation of the target behavior. By carefully adjusting the direction and magnitude of the steering vector, we enabled personalized control over the desired behavior across a spectrum of intensities. Extensive experimentation across various open-ended generation tasks, particularly focusing on steering AI personas, has validated the efficacy of our approach. Moreover, we comprehensively investigate critical alignment-concerning scenarios, such as managing truthfulness, mitigating hallucination, and addressing jailbreaking attacks. Remarkably, our method can still demonstrate outstanding steering effectiveness across these scenarios. Furthermore, we showcase the transferability of our steering vectors across different models/LoRAs and highlight the synergistic benefits of applying multiple vectors simultaneously.

7/31/2024

Analyzing the Generalization and Reliability of Steering Vectors -- ICML 2024

Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk

Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain many technical difficulties of applying steering vectors to guide models' behaviour at scale.

7/23/2024

💬

Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment

Haoran Wang, Kai Shu

To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack (TA^2), which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. Our experiment results on four primary alignment tasks show that TA^2 is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks.

8/19/2024