Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

2302.12461

Published 5/7/2024 by Max Lamparth, Anka Reuel

💬

Abstract

Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets. Trigger warning: Offensive language.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper explores the internal mechanisms of transformer-based language models that have been "backdoored" - meaning they have been maliciously modified to behave in unintended ways when triggered by specific inputs.
The researchers studied the internal representations of these backdoored models to determine which parts of the model are most responsible for the backdoor behavior.
They then used this knowledge to remove, insert, and modify the backdoor mechanisms in both backdoored and non-backdoored models.

Plain English Explanation

Large language models, like those used for chat applications, can be vulnerable to a security threat called "data poisoning." This means that the training data used to create the model could be tampered with, leading to the model developing unintended, harmful behaviors when triggered by certain inputs.

The inner workings of these "backdoored" language models - how they process trigger inputs and switch to toxic or malicious outputs - have not been well understood until now. This paper explores the internal representations of backdoored transformer-based language models to determine which parts of the model are most responsible for the backdoor behavior.

The researchers found that the early-layer MLP (multi-layer perceptron) modules, combined with the initial embedding projection, are the most important components for the backdoor mechanism. Using this knowledge, they were able to remove, insert, and modify the backdoor mechanisms in both backdoored and non-backdoored models.

This is an important step in understanding and mitigating the security risks posed by backdoored language models, which could be used to launch backdoor attacks on multilingual machine translation systems or inject backdoors through instruction-tuning.

Technical Explanation

The researchers used a technique called "PCP ablation" to study the internal representations of transformer-based backdoored language models. PCP ablation involves replacing transformer modules with low-rank matrices based on the principal components of their activations. This allowed the researchers to identify the early-layer MLP modules and initial embedding projection as the most critical components for the backdoor mechanism.

By targeting these specific modules, the researchers demonstrated that they could improve the backdoor robustness of large language models, even when fine-tuning on potentially poisoned data sets. This approach builds on previous work on backdoor defense through denoising.

The researchers evaluated their techniques on a range of models, including backdoored toy models, backdoored large models, and non-backdoored open-source models. Their results show that they can effectively remove, insert, and modify the backdoor mechanisms in these different types of models.

Critical Analysis

The researchers acknowledge that their study is limited to transformer-based language models and may not generalize to other model architectures. Additionally, they note that their techniques may not be able to completely eliminate all types of backdoor vulnerabilities, as there may be other mechanisms that are not covered by their analysis.

One potential concern is that the researchers' approach of modifying individual model modules could have unintended consequences on the model's overall performance and capabilities, beyond just the backdoor behavior. Further research may be needed to fully understand the tradeoffs and implications of this approach.

It would also be valuable to see the researchers' techniques applied to a wider range of real-world language models and use cases, to better assess their practical effectiveness and limitations. Transferring the insights from this research to defend against cross-lingual backdoor attacks could be an important next step.

Overall, this research represents a significant advance in our understanding of backdoored language models and provides a promising approach for improving their security. However, as with any security-related research, it is important to continue exploring the complexities and potential pitfalls to ensure that these techniques are applied responsibly and with the utmost care.

Conclusion

This paper offers important insights into the inner workings of backdoored transformer-based language models, identifying the critical components responsible for their malicious behavior. By targeting these specific model modules, the researchers demonstrate techniques for removing, inserting, and modifying backdoor mechanisms, which could help improve the security and robustness of large language models.

While this research is a valuable step forward, there are still many open questions and potential limitations that require further exploration. Ongoing work in this area is crucial to ensuring the safe and responsible development of powerful language models that can be deployed with confidence across a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exploring Backdoor Vulnerabilities of Chat Models

Yunzhuo Hao, Wenkai Yang, Yankai Lin

Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor can not be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic content.

4/4/2024

cs.CR cs.AI cs.CL

💬

Backdoor Removal for Generative Large Language Models

Haoran Li, Yulin Chen, Zihao Zheng, Qi Hu, Chunkit Chan, Heshan Liu, Yangqiu Song

With rapid advances, generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. Yet, language models' inherent vulnerabilities may be exacerbated due to increased accessibility and unrestricted model training on massive textual data from the Internet. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. Backdoored LLMs behave innocuously for normal queries and generate harmful responses when the backdoor trigger is activated. Despite significant efforts paid to LLMs' safety issues, LLMs are still struggling against backdoor attacks. As Anthropic recently revealed, existing safety training strategies, including supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during the pre-training stage. In this paper, we present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs. We initially propose Overwrite Supervised Fine-tuning (OSFT) for effective backdoor removal when the trigger is known. Then, to handle the scenarios where the trigger patterns are unknown, we integrate OSFT into our two-stage framework, SANDE. Unlike previous works that center on the identification of backdoors, our safety-enhanced LLMs are able to behave normally even when the exact triggers are activated. We conduct comprehensive experiments to show that our proposed SANDE is effective against backdoor attacks while bringing minimal harm to LLMs' powerful capability without any additional access to unbackdoored clean models. We will release the reproducible code.

5/14/2024

cs.CR cs.CL

Backdoor Attack on Multilingual Machine Translation

Jun Wang, Qiongkai Xu, Xuanli He, Benjamin I. P. Rubinstein, Trevor Cohn

While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages. Our experimental results reveal that injecting less than 0.01% poisoned data into a low-resource language pair can achieve an average 20% attack success rate in attacking high-resource language pairs. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings. Our aim is to bring attention to these vulnerabilities within MNMT systems with the hope of encouraging the community to address security concerns in machine translation, especially in the context of low-resource languages.

4/4/2024

cs.CL

💬

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen

We investigate security concerns of the emergent instruction tuning paradigm, that models are trained on crowdsourced datasets with task instructions to achieve superior performance. Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions (~1000 tokens) and control model behavior through data poisoning, without even the need to modify data instances or labels themselves. Through such instruction attacks, the attacker can achieve over 90% attack success rate across four commonly used NLP datasets. As an empirical study on instruction attacks, we systematically evaluated unique perspectives of instruction attacks, such as poison transfer where poisoned models can transfer to 15 diverse generative datasets in a zero-shot manner; instruction transfer where attackers can directly apply poisoned instruction on many other datasets; and poison resistance to continual finetuning. Lastly, we show that RLHF and clean demonstrations might mitigate such backdoors to some degree. These findings highlight the need for more robust defenses against poisoning attacks in instruction-tuning models and underscore the importance of ensuring data quality in instruction crowdsourcing.

4/4/2024

cs.CL cs.AI cs.CR cs.LG