Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Read original: arXiv:2404.05880 - Published 7/4/2024 by Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, Cen Chen

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Overview

This paper introduces "Eraser", a technique to defend large language models (LLMs) against "jailbreaking" - the ability of users to make LLMs produce harmful or unintended outputs.
Eraser works by "unlearning" harmful knowledge from LLMs during training, reducing the risk of LLMs being used for malicious purposes.
The paper presents experimental results showing Eraser can effectively mitigate jailbreaking attacks while maintaining the LLM's performance on benign tasks.

Plain English Explanation

Large language models (LLMs) like ChatGPT are powerful AI systems that can generate human-like text on a wide range of topics. However, there is a risk that users could "jailbreak" these models to make them produce harmful or unintended outputs, such as hate speech, misinformation, or instructions for illegal activities. This is an important challenge for ensuring the safe and responsible development of LLMs.

The researchers in this paper introduce a technique called "Eraser" that aims to defend against jailbreaking by "unlearning" harmful knowledge from the LLM during the training process. The idea is to reduce the risk of the LLM being used for malicious purposes, while still maintaining its performance on benign, intended tasks.

The paper builds on previous work on "unlearning" in large language models, which has shown that it is possible to selectively remove certain types of knowledge from these models. Eraser takes this idea further, targeting the specific problem of jailbreaking defense.

Through experiments, the researchers demonstrate that Eraser can effectively mitigate jailbreaking attacks while preserving the LLM's performance on normal, intended use cases. This suggests that Eraser could be a useful tool for making LLMs more robust and trustworthy as they become more widely deployed.

Technical Explanation

The key technical components of the Eraser approach are:

Jailbreaking Benchmark: The researchers first developed a "jailbreaking benchmark" called JailbreakBench to evaluate the LLM's resilience to jailbreaking attacks. This benchmark consists of a suite of prompts designed to test the model's ability to resist generating harmful outputs.
Unlearning Harmful Knowledge: During training, Eraser uses a technique called "unlearning" to selectively remove harmful knowledge from the LLM. This is done by introducing a "forgetting" objective that pushes the model to unlearn certain types of undesirable information, while preserving its performance on benign tasks.
Adversarial Fine-tuning: Additionally, Eraser employs an adversarial fine-tuning approach, where the model is exposed to jailbreaking prompts during training and learns to resist generating harmful outputs in response.

The researchers evaluated Eraser on several large language models, including GPT-3 and GPT-J, and found that it can significantly improve the models' robustness to jailbreaking attacks while maintaining their performance on intended tasks. For example, on the JailbreakBench, Eraser reduced the models' likelihood of generating harmful outputs by up to 90% compared to the baseline.

Critical Analysis

The Eraser approach presented in this paper is a promising step towards making large language models more robust and trustworthy. By proactively "unlearning" harmful knowledge during training, the researchers have demonstrated a practical way to mitigate the risk of LLMs being used for malicious purposes.

However, the paper also acknowledges some limitations and areas for further research:

Defining Harmful Knowledge: One challenge is the difficulty in precisely defining what constitutes "harmful" knowledge that should be unlearned. The researchers used a set of heuristics and human evaluation, but there may be edge cases or ambiguities that are hard to capture.
Preserving Utility: While Eraser was able to maintain the LLM's performance on benign tasks, there may still be some trade-offs or unintended consequences of the unlearning process that need to be further explored.
Generalization and Scalability: The experiments in this paper were conducted on a relatively small set of language models and benchmarks. More research is needed to understand how well Eraser would scale and generalize to a wider range of models and use cases.
Adversarial Robustness: While Eraser improves the models' resilience to jailbreaking attacks, it's unclear how it would perform against more sophisticated adversarial attacks that try to disguise harmful goals as benign narratives.

Overall, the Eraser approach represents an important step forward in the ongoing effort to develop safe and responsible large language models. However, continued research and careful evaluation will be necessary to address the remaining challenges and ensure these powerful AI systems are deployed in a way that maximizes their benefits while minimizing the risks.

Conclusion

This paper introduces "Eraser", a technique to defend large language models against jailbreaking attacks by selectively "unlearning" harmful knowledge during the training process. The researchers demonstrate that Eraser can effectively mitigate the risk of LLMs being used for malicious purposes, while preserving their performance on intended, benign tasks.

The Eraser approach is a promising contribution to the field of AI safety and robustness, addressing the critical challenge of ensuring large language models are developed and deployed responsibly. While further research is needed to address the remaining limitations, this work represents an important step forward in making these powerful AI systems more trustworthy and aligned with societal well-being.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, Cen Chen

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.

7/4/2024

🌐

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions emph{without} any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on emph{out-of-distribution} (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6% to 7.7%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at url{https://github.com/thu-coai/SafeUnlearning}.

7/4/2024

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li

Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of jailbreaking, which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.

9/2/2024