Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Read original: arXiv:2402.14968 - Published 6/21/2024 by Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Overview

This paper explores a technique called "Backdoor Enhanced Alignment" to mitigate the risk of fine-tuning jailbreak attacks in large language models.
Jailbreak attacks involve training a model to bypass safety constraints, allowing it to generate harmful or malicious content.
The proposed approach aims to make it more difficult for attackers to find and exploit vulnerabilities in fine-tuned models.

Plain English Explanation

Large language models like GPT-3 are powerful tools that can generate human-like text on a wide range of topics. However, these models can also be vulnerable to "jailbreak attacks," where an attacker trains the model to bypass safety constraints and generate harmful or malicious content.

This paper introduces a technique called "Backdoor Enhanced Alignment" to help mitigate this risk. The key idea is to make it more difficult for attackers to find and exploit vulnerabilities in fine-tuned models.

The researchers describe how they can "hardwire" certain safety behaviors into the model during the fine-tuning process, essentially creating a "backdoor" that makes it harder for an attacker to override the model's alignment with desired behaviors. This approach is designed to complement existing techniques like mimicking user data to further improve the safety of fine-tuned models.

The paper also provides an overview of backdoor attacks and defenses in large language models, as well as the results of a competition where researchers tried to find universal jailbreak backdoors in aligned models.

Overall, this research aims to address an important challenge in the field of safe and reliable AI systems, helping to ensure that large language models can be fine-tuned for specific tasks without introducing new security vulnerabilities.

Technical Explanation

The paper proposes a technique called "Backdoor Enhanced Alignment" (BEA) to mitigate the risk of fine-tuning jailbreak attacks in large language models. The key idea is to "hardwire" certain safety behaviors into the model during the fine-tuning process, creating a "backdoor" that makes it more difficult for an attacker to override the model's alignment with desired behaviors.

The researchers first provide an overview of backdoor attacks and defenses in large language models, drawing on existing research in this area. They then describe the BEA approach, which involves adding a special "backdoor" token to the input during fine-tuning. This token triggers the model to generate text that aligns with the desired safety constraints, even if the model is later fine-tuned or attacked.

To evaluate the effectiveness of BEA, the researchers conducted a series of experiments, including a competition where researchers tried to find universal jailbreak backdoors in aligned models. The results suggest that BEA can significantly reduce the risk of jailbreak attacks, while still allowing the model to be fine-tuned for specific tasks.

Critical Analysis

The paper presents a promising approach to mitigating the risk of fine-tuning jailbreak attacks in large language models. The idea of "hardwiring" certain safety behaviors into the model during fine-tuning is an interesting and potentially effective strategy.

However, the paper also acknowledges some limitations and areas for further research. For example, the researchers note that the BEA approach may not be fully resistant to more sophisticated attacks, and that additional techniques may be needed to ensure the long-term safety of fine-tuned models.

Additionally, while the competition results suggest that BEA can be effective, it's unclear how well the approach would scale to larger or more complex models, or how it might interact with other fine-tuning techniques like mimicking user data.

Overall, this research represents an important step forward in the effort to develop safe and reliable AI systems. However, continued research and innovation will be necessary to fully address the complex challenges of ensuring the safety and security of large language models.

Conclusion

The "Backdoor Enhanced Alignment" technique proposed in this paper is a promising approach to mitigating the risk of fine-tuning jailbreak attacks in large language models. By "hardwiring" certain safety behaviors into the model during fine-tuning, the researchers aim to make it more difficult for attackers to bypass the model's alignment with desired behaviors.

While the paper acknowledges some limitations and areas for further research, the overall findings suggest that BEA can be an effective tool for improving the safety and security of fine-tuned models. As the field of AI continues to advance, developing robust techniques for ensuring the reliability and trustworthiness of large language models will be crucial for realizing their full potential.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao

Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, service providers will construct prefixed safety examples with a secret prompt, acting as a backdoor trigger. By integrating prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the backdoor attack, establishing a strong correlation between the secret prompt and safety generations. Consequently, safe responses are ensured once service providers prepend this secret prompt ahead of any user input during inference. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance. Furthermore, we also present the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data.

6/21/2024

👀

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales

Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning, and that VLLM fine-tuning can cause forgetting of safety alignment previously learned by the underpinning LLM. To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories. Our experiments demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns VLLMs. This alignment is achieved with minimal impact on, or even enhancement of, the models' helpfulness. The versatility of our safety fine-tuning dataset makes it a valuable resource for safety-testing existing VLLMs, training new models or safeguarding pre-trained VLLMs. Empirical results demonstrate that fine-tuned VLLMs effectively reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks, which approach zero in many cases. The code and dataset are available at https://github.com/ys-zong/VLGuard.

6/19/2024

💬

Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Francisco Eiras, Aleksandar Petrov, Phillip H. S. Torr, M. Pawan Kumar, Adel Bibi

Fine-tuning large language models on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work addresses the gap in our understanding of these risks across diverse types of data in closed models - where providers control how user data is utilized in the fine-tuning process. We demonstrate how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is more effective than existing baselines at re-establishing safety alignment while maintaining similar task performance.

7/2/2024

📶

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Javier Rando, Francesco Croce, Kryv{s}tof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tram`er

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.

6/7/2024