Representation Surgery: Theory and Practice of Affine Steering

Read original: arXiv:2402.09631 - Published 6/6/2024 by Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

Representation Surgery: Theory and Practice of Affine Steering

Overview

• This paper introduces MiMiC, a method for generating minimally modified counterfactuals in the representation space of machine learning models.

• Counterfactuals are hypothetical examples that show how a model's prediction would change if certain input features were altered.

• MiMiC aims to find counterfactuals that are as similar as possible to the original input, making them more plausible and relevant for understanding model behavior.

Plain English Explanation

Imagine you have a machine learning model that predicts whether a job applicant will be a good fit for a role. The model might look at factors like the applicant's education, work experience, and skills. If the model predicts that an applicant won't be a good fit, it would be helpful to understand why. [object Object] can provide this insight by showing how the model's prediction would change if certain input features were different.

For example, a counterfactual might show that if the applicant had a few more years of relevant work experience, the model would predict them as a good fit. However, the counterfactuals generated by existing methods can sometimes be quite different from the original input, making them less useful for understanding the model's behavior.

The MiMiC method introduced in this paper aims to generate counterfactuals that are as similar as possible to the original input. This makes the counterfactuals more plausible and relevant, helping to better explain the model's decision-making process. By modifying the model's [object Object] rather than the raw input features, MiMiC can find counterfactuals that are minimally different from the original example.

Technical Explanation

The key innovation of the MiMiC method is that it generates counterfactuals by directly modifying the model's internal [object Object], rather than the raw input features. This allows MiMiC to find counterfactuals that are as similar as possible to the original input, making them more plausible and relevant for understanding the model's behavior.

The paper first provides background on representation-space counterfactuals, which aim to find counterfactuals by perturbing the model's internal representations rather than the inputs. MiMiC builds on this approach, using an optimization-based method to find the smallest possible changes to the representation space that result in a different model prediction.

The authors evaluate MiMiC on several benchmark datasets and compare it to existing counterfactual generation methods. They show that MiMiC consistently produces counterfactuals that are more similar to the original inputs, while still being effective at changing the model's predictions.

Critical Analysis

The paper provides a thorough technical explanation of the MiMiC method and its advantages over prior work. However, the authors acknowledge several limitations and areas for future research:

MiMiC relies on access to the model's internal representation space, which may not always be available in a real-world setting.
The optimization-based approach used by MiMiC can be computationally expensive, especially for larger models.
The paper focuses on image and tabular data, but it's unclear how well MiMiC would perform on more complex data types like text or speech.

Additionally, while the authors demonstrate that MiMiC produces more plausible counterfactuals, they don't fully explore the implications of this for model interpretability and debugging. It would be interesting to see user studies or other evaluations that assess the practical value of MiMiC's counterfactuals for real-world applications.

Conclusion

The MiMiC method introduced in this paper represents an important advance in the field of counterfactual generation for machine learning models. By generating counterfactuals that are minimally modified from the original input, MiMiC can provide more relevant and plausible insights into a model's decision-making process. This can help developers better understand and debug their models, as well as communicate model behavior to end-users more effectively.

While MiMiC has some limitations that warrant further research, the core idea of leveraging a model's internal representation space to find high-fidelity counterfactuals is a promising direction for the field of [object Object].

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Representation Surgery: Theory and Practice of Affine Steering

Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.

6/6/2024

Representation Tuning

Christopher M. Ackerman

Activation engineering is becoming increasingly popular as a means of online control of large language models (LLMs). In this work, I extend the idea of active steering with vectors that represent a behavioral direction of interest to tuning those vectors directly into the model, obviating the need for online control. First, I identify activation vectors related to honesty in an open-source LLM (Llama- 2-13b-chat). Next, I demonstrate that model output can be made more or less honest by adding positive or negative multiples of these vectors to residual stream activations during generation. Then, I show that a similar effect can be achieved by fine-tuning the vectors directly into the model, by use of a dual loss function based on the cosine similarity of residual stream activations to the vectors combined with a standard token-based loss (representation tuning). Finally, I compare the generations in response to honesty-probing prompts from the resulting models to those from models fine-tuned with a token-based loss alone, and to those from the untuned model subjected to online steering. Overall, fine-tuning the vectors into the models using the cosine similarity plus token loss showed a stronger effect than online steering, and generalized better than using the standard loss, suggesting the potential utility of this approach as a safety measure. Code and data are available at https://github.com/cma1114/representation_tuning; tuned models are available at https://huggingface.co/collections/cackerman/ representation-tuning-66da1e5ab41cd1b824687d9f.

9/12/2024

💬

Word Embeddings Are Steers for Language Models

Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang, Tarek Abdelzaher, Heng Ji

Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles. We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs' size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality. The learned LM-Steer serves as a lens in text styles: it reveals that word embeddings are interpretable when associated with language model generations and can highlight text spans that most indicate the style differences. An LM-Steer is transferrable between different language models by an explicit form calculation. One can also continuously steer LMs simply by scaling the LM-Steer or compose multiple LM-Steers by adding their transformations. Our codes are publicly available at url{https://github.com/Glaciohound/LM-Steer}.

6/7/2024

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, Jinghui Chen

Researchers have been studying approaches to steer the behavior of Large Language Models (LLMs) and build personalized LLMs tailored for various applications. While fine-tuning seems to be a direct solution, it requires substantial computational resources and may significantly affect the utility of the original LLM. Recent endeavors have introduced more lightweight strategies, focusing on extracting steering vectors to guide the model's output toward desired behaviors by adjusting activations within specific layers of the LLM's transformer architecture. However, such steering vectors are directly extracted from the activations of human preference data and thus often lead to suboptimal results and occasional failures, especially in alignment-related scenarios. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs, thereby offering a more precise representation of the target behavior. By carefully adjusting the direction and magnitude of the steering vector, we enabled personalized control over the desired behavior across a spectrum of intensities. Extensive experimentation across various open-ended generation tasks, particularly focusing on steering AI personas, has validated the efficacy of our approach. Moreover, we comprehensively investigate critical alignment-concerning scenarios, such as managing truthfulness, mitigating hallucination, and addressing jailbreaking attacks. Remarkably, our method can still demonstrate outstanding steering effectiveness across these scenarios. Furthermore, we showcase the transferability of our steering vectors across different models/LoRAs and highlight the synergistic benefits of applying multiple vectors simultaneously.

7/31/2024