DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Read original: arXiv:2404.10464 - Published 8/13/2024 by Yu Li, Han Jiang, Chuanyang Gong, Zhihua Wei

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Overview

This paper introduces DeStein, a method for reducing the toxicity and biases in large language models (LLMs) through a process called "detoxification."
The researchers propose using "universal steering pairs" - specific prompts that can steer the model's outputs in a less toxic direction - and "head-wise activation fusion" - a technique for selectively modifying the model's internal representations to improve its behavior.
The paper also explores the challenges of evaluating detoxification, and presents several novel evaluation metrics and benchmarks to assess the safety and quality of detoxified language models.

Plain English Explanation

The paper is about finding ways to make large language models, like GPT-3, less likely to generate harmful or biased text. These models can sometimes produce content that is toxic, offensive, or reflects unfair biases. The researchers developed a technique called "DeStein" to address this problem.

The key ideas behind DeStein are:

Universal Steering Pairs: The researchers identified specific prompts or instructions that can "steer" the language model to generate less toxic outputs. These prompts act as a kind of override or correction to the model's original tendencies.
Head-wise Activation Fusion: This refers to a way of selectively modifying the internal workings of the language model to improve its behavior. By focusing on specific parts (or "heads") of the model, the researchers were able to make targeted changes to reduce toxicity and biases.

The paper also discusses the challenges of evaluating the safety and quality of detoxified language models. The researchers proposed new metrics and benchmarks to help assess how well the models perform in terms of reducing harm and unfairness, while still maintaining their usefulness for tasks like language generation and text understanding.

Overall, the goal of this research is to make large language models more reliable and trustworthy, so they can be used more safely and ethically in real-world applications.

Technical Explanation

The key technical aspects of the DeStein method are:

Universal Steering Pairs: The researchers identified a set of prompts, or "steering pairs," that could be used to guide the language model towards less toxic outputs. These pairs consist of a "prompt" that elicits a potentially toxic response, and a "detox" prompt that steers the model in a more positive direction. By using these pairs during training and inference, the model learns to generate less harmful text.
Head-wise Activation Fusion: Language models like GPT-3 have a complex internal architecture, with multiple "attention heads" that focus on different aspects of the input. The researchers found that by selectively modifying the activations of these heads, they could improve the model's behavior without significantly degrading its overall performance. This "head-wise activation fusion" technique allows for more targeted and effective detoxification.
Evaluation Metrics and Benchmarks: Assessing the safety and quality of detoxified language models is challenging, as traditional metrics like perplexity or language modeling accuracy may not capture the nuances of toxicity and bias. The researchers proposed several new evaluation methods, including "degreed toxicity," "toxicity distribution," and "bias deviation," to better quantify the improvements made by the DeStein approach.

The paper presents experiments and results demonstrating the effectiveness of the DeStein method in reducing toxicity and biases across a range of language modeling tasks and datasets. The researchers also discuss the limitations of their approach and areas for future research, such as the need for more comprehensive and inclusive evaluation frameworks.

Critical Analysis

The DeStein paper presents a promising approach for addressing the toxicity and biases inherent in large language models. The use of "universal steering pairs" and "head-wise activation fusion" are innovative techniques that demonstrate the potential for targeted interventions to improve model behavior.

However, the paper also acknowledges several key limitations and challenges:

Scope of Detoxification: While the DeStein method was able to reduce certain types of toxicity and biases, it's unclear how comprehensive the detoxification is, or whether it can address the full range of harmful outputs that language models are capable of generating.
Evaluation Challenges: The new evaluation metrics proposed in the paper are a step in the right direction, but they may not capture all the nuances of safety and ethics. More research is needed to develop holistic and inclusive evaluation frameworks for detoxified language models.
Generalizability: The experiments in the paper were conducted on a limited set of language models and datasets. It's unclear how well the DeStein approach would scale or generalize to a broader range of models and applications.
Ethical Considerations: While the goal of the research is to make language models more ethical and trustworthy, there are still open questions about the broader societal implications of deploying such systems, and the potential for unintended consequences.

Overall, the DeStein paper represents an important contribution to the field of ethical AI and the ongoing efforts to mitigate the harms of large language models. However, continued research and critical scrutiny will be necessary to ensure that these technologies are developed and deployed responsibly and with the best interests of society in mind.

Conclusion

The DeStein paper introduces a novel approach for reducing the toxicity and biases in large language models through the use of "universal steering pairs" and "head-wise activation fusion." The researchers have also developed new evaluation metrics and benchmarks to assess the safety and quality of detoxified language models.

While the DeStein method shows promising results, the paper also highlights the complexities and challenges involved in creating truly ethical and trustworthy AI systems. Continued research and development in this area, coupled with robust evaluation frameworks and rigorous ethical oversight, will be essential for ensuring that language models and other AI technologies are deployed in a way that benefits society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Yu Li, Han Jiang, Chuanyang Gong, Zhihua Wei

Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving finetuning or auxiliary models usually require extensive computational resources, hindering their practicality in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxifies LMs by applying representation engineering in activation spaces with lower resource and time costs. Specifically, we derive detoxification vectors from self-induced, universal steering pairs through arithmetic operations in activation spaces. During inference, detoxification is achieved by fusing the detoxification vectors with the original representations in a head-wise manner. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on various metrics, while also maintaining satisfactory generation quality and diversity. We further validate the practicality and scalability of DeStein with a series of white-box LLMs. The method is open-sourced at https://github.com/LizLizLi/DeStein. Warning: Some example model outputs may contain highly offensive or disturbing text.

8/13/2024

💬

Activation Addition: Steering Language Models Without Optimization

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking activation differences resulting from pairs of prompts. We demonstrate ActAdd on a range of LLMs (LLaMA-3, OPT, GPT-2, and GPT-J), obtaining SOTA on detoxification and negative-to-positive sentiment control. Our approach yields inference-time control over high-level properties of output like topic and sentiment while preserving performance on off-target tasks. ActAdd takes far less compute and implementation effort than finetuning or RLHF, allows users control through natural language, and its computational overhead (as a fraction of inference time) appears stable or improving over increasing model size.

6/5/2024

💬

Detoxifying Large Language Models via Knowledge Editing

Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen

This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.

5/29/2024

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman

Language models (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.

6/26/2024