Steering Llama 2 via Contrastive Activation Addition

Read original: arXiv:2312.06681 - Published 7/8/2024 by Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

🔄

Overview

Introduces Contrastive Activation Addition (CAA), a new method for steering language models by modifying their activations during inference
CAA computes steering vectors based on the difference in residual stream activations between positive and negative examples of a particular behavior
These steering vectors are added to the model's activations at all token positions after the user's prompt, allowing precise control over the degree of the targeted behavior
Evaluate CAA's effectiveness on LLaMA 2 Chat using behavioral question datasets and open-ended generation tasks
Demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods, and minimally reduces capabilities
Gain insights into CAA's mechanisms using activation space interpretation methods

Plain English Explanation

Contrastive Activation Addition (CAA) is a new way to steer the behavior of large language models, like LLaMA 2 Chat, without having to retrain or fine-tune the model. The key idea is to compute "steering vectors" that capture the difference in model activations between examples of the desired behavior (e.g., factual responses) and undesired behavior (e.g., hallucinatory responses).

During inference, these steering vectors are then added to the model's activations at each step, allowing the user to precisely control the degree of the targeted behavior. For example, you could add a positive steering vector to nudge the model towards more factual responses, or a negative steering vector to discourage hallucinations.

The researchers found that CAA is effective at altering model behavior, and that it can be used in addition to or instead of other techniques like fine-tuning or prompt engineering. Importantly, CAA was able to achieve these behavioral changes with minimal reduction in the model's overall capabilities.

The researchers also used various activation space interpretation methods to gain deeper insights into how CAA works and how high-level concepts are represented in large language models. This sheds light on the inner workings of these powerful AI systems.

Technical Explanation

Contrastive Activation Addition (CAA) is a novel method for steering the behavior of large language models during inference. Unlike fine-tuning or prompt engineering, CAA directly modifies the model's internal activations to achieve the desired behavior.

The key steps of CAA are:

Compute steering vectors: The researchers first identify pairs of examples that represent the desired and undesired behaviors (e.g., factual vs. hallucinatory responses). They then compute the difference in residual stream activations between these positive and negative examples, resulting in a "steering vector" that captures the activation patterns associated with the target behavior.
Apply steering vectors: During inference, the computed steering vectors are added to the model's activations at all token positions after the user's prompt. The magnitude of the steering vectors can be adjusted to control the degree of the targeted behavior.

The researchers evaluated CAA's effectiveness on the LLaMA 2 Chat model using multiple-choice behavioral question datasets and open-ended generation tasks. They found that CAA significantly altered the model's behavior, often outperforming or complementing traditional techniques like fine-tuning and prompt engineering.

To gain deeper insights into how CAA works, the researchers employed various activation space interpretation methods. These analyses shed light on how high-level concepts are represented in large language models and the mechanisms by which CAA is able to steer the model's outputs.

Critical Analysis

The researchers provide a thorough evaluation of CAA's effectiveness and discuss several important limitations and areas for further research:

The paper focuses on a limited set of behavioral tasks and datasets, so it's unclear how well CAA would generalize to a wider range of applications.
The researchers acknowledge that CAA may not be able to achieve the same level of behavioral control as fine-tuning, as it operates at a more superficial level by modifying activations rather than retraining the entire model.
The paper does not explore the long-term stability of CAA's effects or whether the steering vectors would need to be periodically updated as the model's underlying knowledge evolves.
The activation space interpretation methods used provide valuable insights, but the researchers note that there are still many open questions about how large language models represent and reason about high-level concepts.

Overall, the paper presents a promising new approach for steering language model behavior, but further research is needed to fully understand the capabilities and limitations of CAA.

Conclusion

Contrastive Activation Addition (CAA) introduces an innovative method for precisely controlling the behavior of large language models during inference. By computing and applying "steering vectors" that capture the difference in activations between positive and negative examples, CAA can significantly alter model outputs without the need for costly fine-tuning or prompt engineering.

The researchers' evaluation demonstrates the effectiveness of CAA, and their activation space interpretation methods provide valuable insights into how these powerful AI systems represent and reason about high-level concepts. While further research is needed to fully understand the capabilities and limitations of CAA, this work represents an important step forward in the field of language model steering and control.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes steering vectors by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

7/8/2024

💬

Activation Addition: Steering Language Models Without Optimization

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking activation differences resulting from pairs of prompts. We demonstrate ActAdd on a range of LLMs (LLaMA-3, OPT, GPT-2, and GPT-J), obtaining SOTA on detoxification and negative-to-positive sentiment control. Our approach yields inference-time control over high-level properties of output like topic and sentiment while preserving performance on off-target tasks. ActAdd takes far less compute and implementation effort than finetuning or RLHF, allows users control through natural language, and its computational overhead (as a fraction of inference time) appears stable or improving over increasing model size.

6/5/2024

🔎

Programming Refusal with Conditional Activation Steering

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. In this paper, we propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Our method is based on the observation that different categories of prompts activate distinct patterns in the model's hidden states. Using CAST, one can systematically control LLM behavior with rules like if input is about hate speech or adult content, then refuse or if input is not about legal advice, then refuse. This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization. We release an open-source implementation of our framework.

9/11/2024

Representation Tuning

Christopher M. Ackerman

Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, I extend the idea of active steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, I identify activation vectors related to honesty in an open-source LLM (Llama- 2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss (representation tuning). Finally, I compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning; tuned models are available at https://huggingface.co/collections/cackerman/ representation-tuning-66da1e5ab41cd1b824687d9f.

9/12/2024