Programming Refusal with Conditional Activation Steering

Read original: arXiv:2409.05907 - Published 9/11/2024 by Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

🔎

Overview

Large language models (LLMs) have impressive capabilities, but controlling their behavior remains challenging.
Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical use in settings requiring selective responses, such as content moderation or domain-specific assistants.
This paper proposes Conditional Activation Steering (CAST), which selectively applies or withholds activation steering based on the input context.

Plain English Explanation

The paper discusses the challenge of precisely controlling the behavior of large language models (LLMs), which have shown remarkable capabilities in various tasks. Existing methods for "activation steering" - techniques that modify the behavior of LLMs - can alter the model's responses indiscriminately. This limits their practical application in settings where selective responses are essential, such as content moderation or domain-specific assistants.

To address this, the researchers introduce a new approach called Conditional Activation Steering (CAST). CAST analyzes the activation patterns in the LLM's hidden states during inference and uses this information to selectively apply or withhold activation steering based on the input context. In other words, CAST can systematically control the LLM's behavior with rules like "if the input is about hate speech or adult content, then refuse to respond," or "if the input is not about legal advice, then refuse to respond." This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization (a process that can be computationally intensive).

The key insight behind CAST is the observation that different categories of prompts activate distinct patterns in the model's hidden states. By leveraging this, the researchers have developed a way to target specific types of content while leaving the model's behavior unaltered for other types of input. This could be particularly valuable for building more reliable and responsible LLM-powered applications.

Technical Explanation

The paper proposes a novel method called Conditional Activation Steering (CAST) that selectively applies or withholds activation steering based on the input context. Activation steering is a technique that modifies the behavior of large language models (LLMs) by altering their internal activations during inference.

The core innovation of CAST is the observation that different categories of prompts (e.g., hate speech, legal advice) activate distinct patterns in the LLM's hidden states. CAST leverages this property to develop a system that can systematically control the model's behavior with rules like "if the input is about hate speech or adult content, then refuse to respond."

This selective activation steering approach allows for targeted modification of responses to specific content while maintaining normal responses to other types of input. Crucially, CAST achieves this without requiring weight optimization, a computationally intensive process that can be challenging to apply in practical settings.

The researchers provide an open-source implementation of their CAST framework, making it accessible for further study and application.

Critical Analysis

The paper presents a promising approach to the challenge of controlling LLM behavior in a more nuanced and context-sensitive manner. By selectively applying activation steering based on the input, CAST offers a way to tailor LLM responses for specific use cases, such as content moderation or domain-specific assistants, without indiscriminately altering the model's overall behavior.

One potential limitation of the CAST approach is the reliance on accurately identifying the patterns in the LLM's hidden states that correspond to different categories of input. The effectiveness of the method may depend on the complexity and consistency of these activation patterns, which could vary across different models and domains.

Additionally, the paper does not provide a detailed exploration of the potential unintended consequences or edge cases that may arise from the selective application of activation steering. Further research may be needed to understand the broader implications and ensure the responsible deployment of such techniques.

Despite these considerations, the CAST framework represents an important step towards more nuanced control over LLM behavior, which could have significant implications for the development of safe and reliable AI applications.

Conclusion

This paper proposes Conditional Activation Steering (CAST), a novel method that selectively applies or withholds activation steering based on the input context. By leveraging the observation that different categories of prompts activate distinct patterns in the LLM's hidden states, CAST enables targeted modification of responses to specific content while maintaining normal behavior for other types of input.

The key advantage of CAST is that it achieves this selective control without requiring computationally intensive weight optimization, making it more practical for real-world applications. The open-source implementation provided by the researchers further facilitates the exploration and deployment of this approach.

While the paper highlights the potential of CAST, it also raises questions about the reliability and broader implications of such selective activation steering techniques. Continued research and careful consideration of the ethical and practical considerations will be crucial as the field of LLM control and safety evolves.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Programming Refusal with Conditional Activation Steering

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar

LLMs have shown remarkable capabilities, but precisely controlling their response behavior remains challenging. Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants. In this paper, we propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context. Our method is based on the observation that different categories of prompts activate distinct patterns in the model's hidden states. Using CAST, one can systematically control LLM behavior with rules like if input is about hate speech or adult content, then refuse or if input is not about legal advice, then refuse. This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization. We release an open-source implementation of our framework.

9/11/2024

💬

Activation Addition: Steering Language Models Without Optimization

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking activation differences resulting from pairs of prompts. We demonstrate ActAdd on a range of LLMs (LLaMA-3, OPT, GPT-2, and GPT-J), obtaining SOTA on detoxification and negative-to-positive sentiment control. Our approach yields inference-time control over high-level properties of output like topic and sentiment while preserving performance on off-target tasks. ActAdd takes far less compute and implementation effort than finetuning or RLHF, allows users control through natural language, and its computational overhead (as a fraction of inference time) appears stable or improving over increasing model size.

6/5/2024

🔄

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes steering vectors by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

7/8/2024

Activation Steering for Robust Type Prediction in CodeLLMs

Francesca Lucchetti, Arjun Guha

CodeLLMs are transforming software development as we know it. This is especially true for tasks where rule-based approaches fall short, like type prediction. The type prediction task consists in adding a new type annotation to a partially typed program, such that the resulting program is closer to being fully typed. The intractability of rule-based approaches and high cost of manual annotation make CodeLLMs an attractive solution to the problem. However, CodeLLMs are still far from being deployed on the large-scale due to doubts surrounding their reliability. To shed some light on how CodeLLMs approach type prediction, we investigate what happens when a model mispredicts a type. We show that by applying semantics-preserving edits to code, CodeLLMs are eventually misled into mispredicting type annotations. However, by leveraging activation steering we are able to steer the model back to the correct prediction, making models more robust against semantically irrelevant prompt features. We show that steering achieves comparable performance to fine-tuning directly on the type prediction task. Furthermore, we find that steering vectors computed from Python code are effective at correcting TypeScript mispredictions, and vice versa. To our knowledge, this is the first evidence of its kind to suggest that CodeLLMs learn task representations that transfer across languages.

9/16/2024