Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Read original: arXiv:2407.02855 - Published 7/4/2024 by Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

🌐

Overview

Large language models (LLMs) are vulnerable to jailbreak attacks, even after safety alignment techniques are applied.
Jailbreak attacks can generate harmful queries that result in similar harmful responses, rooted in the same underlying knowledge.
The researchers propose that directly unlearning the harmful knowledge in the LLM may be more effective than the mainstream supervised fine-tuning (SFT) approaches.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text on a wide range of topics. However, these models can also be tricked into producing harmful, dangerous, or unethical content through a technique called "jailbreaking." Even after LLMs have been trained to be "safe" and aligned with human values, researchers have found that jailbreak attacks can still bypass these safeguards and generate harmful queries.

The key insight from this research is that, while jailbreak attacks can take many different forms, the harmful responses they generate tend to be rooted in the same underlying knowledge within the LLM. Rather than trying to anticipate and block every possible jailbreak prompt, the researchers propose a more direct approach: unlearning the harmful knowledge itself.

Through extensive experiments, the researchers found that by training the LLM to forget this harmful knowledge using just a small set of raw, harmful questions (without the jailbreak prompts), they were able to significantly reduce the model's susceptibility to jailbreak attacks, even on "out-of-distribution" harmful queries wrapped in complex prompts. This approach outperformed fine-tuning the model on a large dataset of safety-aligned samples.

The researchers attribute this strong generalization to the inherent relatedness of the harmful responses, such as shared patterns, actions, and representations in the LLM. By targeting the root cause of the problem, their unlearning-based approach appears to be a more effective defense against jailbreak attacks compared to traditional techniques.

Technical Explanation

The key idea behind this research is that while different types of jailbreak attacks can generate significantly different queries, they tend to result in similar harmful responses that are rooted in the same underlying knowledge within the LLM. The researchers hypothesize that directly unlearning this harmful knowledge may be a more effective defense against jailbreak attacks than the mainstream supervised fine-tuning (SFT) approaches.

To test this hypothesis, the researchers conducted extensive experiments. They used just 20 raw, harmful questions (without any jailbreak prompts) to train the LLM to unlearn the harmful knowledge. This significantly outperformed a model fine-tuned on about 0.1M safety-aligned samples, even when the fine-tuned model was given an additional safety system prompt.

The researchers found that their unlearning-based approach reduced the Attack Success Rate (ASR) on "out-of-distribution" harmful questions wrapped in various complex jailbreak prompts from 82.6% to just 7.7% for the Vicuna-7B model. In contrast, the fine-tuned Llama2-7B-Chat model still had an ASR of 21.9% even with the additional safety prompt.

Further analysis revealed that the generalization ability of the unlearning-based approach stems from the inherent relatedness of the harmful responses, such as shared patterns, actions, and similarity in the learned representations within the LLM. By targeting the root cause of the problem, this approach appears to be a more effective defense against jailbreak attacks than traditional techniques.

Critical Analysis

The researchers provide a compelling argument for their unlearning-based approach to defending against jailbreak attacks on LLMs. Their extensive experiments demonstrate the effectiveness of this method in reducing the susceptibility of LLMs to harmful queries, even on "out-of-distribution" examples.

However, the paper does not address several important considerations:

The researchers used a relatively small set of just 20 harmful questions for the unlearning process. It's unclear how scalable this approach would be in the real world, where the space of potentially harmful knowledge is vast and continuously evolving.
The paper does not explore the potential negative side effects of unlearning this harmful knowledge, such as potential impacts on the model's overall performance or its ability to engage in benign and beneficial conversations.
The researchers acknowledge that their approach does not completely eliminate the risk of jailbreak attacks. While it significantly reduces the attack success rate, there may still be avenues for determined adversaries to find ways to bypass the defenses.
The paper does not provide details on how the unlearning process could be implemented in a practical, scalable, and efficient manner for real-world deployment of LLMs.

Overall, the research presents a promising alternative to traditional fine-tuning approaches for defending against jailbreak attacks. However, further work is needed to address the scalability, potential side effects, and practical implementation challenges of this unlearning-based approach.

Conclusion

This research proposes a novel approach to defending large language models (LLMs) against jailbreak attacks, which can bypass even safety-aligned models. The key insight is that while different jailbreak attacks can generate diverse queries, the harmful responses tend to be rooted in the same underlying knowledge within the LLM.

By directly unlearning this harmful knowledge, rather than relying on supervised fine-tuning techniques, the researchers were able to significantly reduce the susceptibility of LLMs to jailbreak attacks, even on "out-of-distribution" harmful queries.

This research highlights the potential of targeting the root cause of the problem, rather than trying to anticipate and block every possible jailbreak prompt. While the approach still has some limitations and practical challenges to address, it represents an important step towards more robust and reliable defenses against the growing threat of jailbreak attacks on powerful AI systems like LLMs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions emph{without} any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on emph{out-of-distribution} (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6% to 7.7%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at url{https://github.com/thu-coai/SafeUnlearning}.

7/4/2024

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, Cen Chen

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.

7/4/2024

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran

As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant/leading LLM safety threat. In this paper, we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs to generate helpful and harmless responses to user queries. Our insight in developing SafeDecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. This allows us to mitigate jailbreak attacks by identifying safety disclaimers and amplifying their token probabilities, while simultaneously attenuating the probabilities of token sequences that are aligned with the objectives of jailbreak attacks. We perform extensive experiments on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries. SafeDecoding outperforms six defense methods.

7/29/2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token ``Sure''), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.

6/19/2024