Activation Addition: Steering Language Models Without Optimization

Read original: arXiv:2308.10248 - Published 6/5/2024 by Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

💬

Overview

Controlling the behavior of large language models is a critical challenge
Existing methods include supervised finetuning, reinforcement learning, and prompt engineering
This paper explores a new approach called "activation engineering" to predictably alter model behavior

Plain English Explanation

The researchers in this paper are investigating a new way to control the behavior of large language models, which are powerful AI systems that can generate human-like text. Existing methods for controlling these models, like fine-tuning them on specific tasks or using reinforcement learning, can be time-consuming and expensive.

The researchers' approach, called "activation engineering," is different. Instead of retraining the entire model, they modify the internal activations - the patterns of neuron activity - during the inference process. By adding a special "steering vector" to the activations, they can nudge the model's outputs in a desired direction, like making the text less toxic or more positive in sentiment.

What's novel about their method, called Activation Addition (ActAdd), is that they compute these steering vectors automatically by looking at the differences in activations between pairs of prompts. This means users can control the model's behavior just by giving it the right prompts, without needing to do extensive fine-tuning or reinforcement learning.

The researchers show that ActAdd works well on a range of large language models, allowing them to outperform other methods on tasks like removing toxicity and changing sentiment. Importantly, it does this without hurting the model's performance on unrelated tasks. And as the models get larger, ActAdd becomes more efficient compared to other approaches.

Technical Explanation

The paper introduces a new method called "Activation Addition" (ActAdd) for controlling the behavior of large language models (LLMs) during inference. Rather than relying on costly fine-tuning or reinforcement learning from human feedback, ActAdd modifies the internal activations of the model to steer its outputs in a desired direction.

The key insight is that the activations of an LLM encode high-level information about the model's internal representations and decision-making process. By adding a carefully-constructed "steering vector" to these activations, the researchers can predictably alter properties of the model's outputs, such as their topic, sentiment, or toxicity level.

Unlike prior work that learned these steering vectors, ActAdd computes them automatically by taking the differences in activations between pairs of prompts. This allows users to control the model's behavior simply by providing the right input prompts, without needing to retrain or fine-tune the model.

The researchers evaluate ActAdd on a range of LLMs, including LLaMA-3, OPT, GPT-2, and GPT-J. They demonstrate state-of-the-art performance on tasks like detoxification and sentiment control, while preserving the model's performance on unrelated tasks.

Importantly, ActAdd is computationally efficient, with the overhead of the activation modifications remaining stable or even improving as model size increases. This contrasts with fine-tuning or reinforcement learning approaches, which become increasingly costly as models grow larger.

Critical Analysis

The paper presents a compelling approach to controlling the behavior of large language models, but there are a few caveats to consider:

Interpretability: While the paper demonstrates the effectiveness of ActAdd, it doesn't fully explain how the activation modifications translate to the desired changes in output. More research is needed to understand the underlying mechanisms and ensure the changes are interpretable and aligned with user intent.
Prompt Dependence: The performance of ActAdd relies heavily on the choice of prompts used to compute the steering vectors. It's unclear how robust the method is to variations in prompts or how to best select prompts for a given task or desired outcome.
Generalization: The paper focuses on a limited set of tasks, like detoxification and sentiment control. Further research is needed to understand how well ActAdd can generalize to a broader range of control tasks and model types.
Ethical Considerations: The ability to precisely control the behavior of large language models raises important ethical questions. Researchers should carefully consider the potential for misuse and work to ensure these techniques are developed responsibly.

Despite these caveats, the paper introduces a promising new approach to controlling large language model agents that could have significant implications for the field of AI safety and robustness.

Conclusion

This paper presents a novel method called "Activation Addition" (ActAdd) for controlling the behavior of large language models during inference. By modifying the internal activations of the model, ActAdd can steer the output in a desired direction, such as making it less toxic or more positive in sentiment.

The key advantages of ActAdd are its computational efficiency, the ability to control the model through natural language prompts, and its preservation of performance on unrelated tasks. These features make it a promising alternative to existing fine-tuning and reinforcement learning approaches, which can be costly and time-consuming.

While the paper demonstrates the effectiveness of ActAdd on a range of tasks and models, further research is needed to address questions of interpretability, prompt dependence, and generalization. Nonetheless, this work represents an important step forward in the quest to reliably control the behavior of large language models, with potential applications in personalized steering of LLMs and other areas of AI safety and robustness.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Activation Addition: Steering Language Models Without Optimization

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking activation differences resulting from pairs of prompts. We demonstrate ActAdd on a range of LLMs (LLaMA-3, OPT, GPT-2, and GPT-J), obtaining SOTA on detoxification and negative-to-positive sentiment control. Our approach yields inference-time control over high-level properties of output like topic and sentiment while preserving performance on off-target tasks. ActAdd takes far less compute and implementation effort than finetuning or RLHF, allows users control through natural language, and its computational overhead (as a fraction of inference time) appears stable or improving over increasing model size.

6/5/2024

🔄

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes steering vectors by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

7/8/2024

Representation Tuning

Christopher M. Ackerman

Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, I extend the idea of active steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, I identify activation vectors related to honesty in an open-source LLM (Llama- 2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss (representation tuning). Finally, I compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning; tuned models are available at https://huggingface.co/collections/cackerman/ representation-tuning-66da1e5ab41cd1b824687d9f.

9/12/2024

🔎

Programming Refusal with Conditional Activation Steering

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. In this paper, we propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Our method is based on the observation that different categories of prompts activate distinct patterns in the model's hidden states. Using CAST, one can systematically control LLM behavior with rules like if input is about hate speech or adult content, then refuse or if input is not about legal advice, then refuse. This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization. We release an open-source implementation of our framework.

9/11/2024