SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

2406.05498

Published 6/11/2024 by Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu, Juergen Rahmel

cs.CR cs.AI

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Abstract

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into four major categories: optimization-based attacks such as Greedy Coordinate Gradient (GCG), jailbreak template-based attacks such as Do-Anything-Now, advanced indirect attacks like DrAttack, and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delay to user prompts, as well as be compatible with both open-source and closed-source LLMs. Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM defense instance to concurrently protect the target LLM instance in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs (both target and defense LLMs) have the capability to identify harmful prompts or intentions in user queries, which we empirically validate using the commonly used GPT-3.5/4 models across all major jailbreak attacks. Our measurements show that SelfDefend enables GPT-3.5 to suppress the attack success rate (ASR) by 8.97-95.74% (average: 60%) and GPT-4 by even 36.36-100% (average: 83%), while incurring negligible effects on normal queries. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. These models outperform four SOTA defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. We also empirically show that the tuned models are robust to targeted GCG and prompt injection attacks.

Create account to get full access

Overview

This paper proposes a practical method called "SelfDefend" for large language models (LLMs) to defend themselves against jailbreaking attacks, which aim to bypass the safety constraints of the models.
The authors demonstrate that LLMs can effectively detect and resist jailbreaking attempts through a combination of self-examination and adversarial training.
The research aims to enhance the robustness and security of advanced AI systems, addressing an important challenge in the field of AI safety and alignment.

Plain English Explanation

The paper introduces a technique called "SelfDefend" that allows large language models (LLMs) to protect themselves against a type of attack known as "jailbreaking." Jailbreaking refers to bypassing the safety restrictions and intended behaviors of an LLM, often with the goal of making the model behave in undesirable or unsafe ways.

The SelfDefend approach allows the LLM to constantly monitor its own responses and detect when it is being asked to do something unsafe or unintended. When the model identifies a potential jailbreaking attempt, it can then take steps to resist and defend itself, such as refusing to carry out the harmful request.

This is an important advancement because as LLMs become more capable and widely used, it is crucial that they can protect themselves from being misused or manipulated in ways that could lead to unintended or dangerous consequences. The SelfDefend technique aims to make LLMs more robust and secure, helping to ensure they remain aligned with their intended purpose and safety constraints.

Technical Explanation

The paper introduces a novel defense mechanism called "SelfDefend" that enables large language models (LLMs) to detect and resist jailbreaking attacks. Jailbreaking refers to techniques that bypass the intended safety constraints and behaviors of an LLM, often with the goal of making the model produce harmful or undesirable outputs.

The SelfDefend approach works by training the LLM to constantly monitor its own responses and detect when it is being asked to do something unsafe or unintended. This is achieved through a combination of self-examination and adversarial training, where the model is exposed to a variety of simulated jailbreaking attempts during the training process.

When the model identifies a potential jailbreaking attack, it can then take appropriate defensive actions, such as refusing to carry out the harmful request or alerting the system administrators. The authors demonstrate the effectiveness of this approach through extensive experiments, showing that SelfDefend can significantly improve an LLM's robustness against jailbreaking attacks without compromising its core functionality.

The research builds on previous work on LLM security and safety, and aims to enhance the reliability and trustworthiness of advanced AI systems by providing them with the ability to self-defend against malicious attempts to misuse or manipulate them.

Critical Analysis

The SelfDefend approach presented in the paper is a promising step towards improving the security and robustness of large language models (LLMs). The authors have demonstrated the effectiveness of this technique in detecting and resisting jailbreaking attacks through rigorous experimentation.

One potential limitation of the approach is that it may not be able to defend against all possible jailbreaking attempts, especially those that are highly sophisticated or involve novel attack vectors. The authors acknowledge this and suggest that the SelfDefend mechanism should be continuously updated and refined as new threats emerge.

Additionally, the implementation of SelfDefend may introduce some computational overhead or performance impact on the LLM, which could be a concern for certain real-world applications where efficiency is crucial. The authors mention that they have attempted to minimize these effects, but further research may be needed to fully assess the trade-offs.

Another area for further investigation is the potential for unintended consequences or side effects of the SelfDefend mechanism. While the goal is to enhance the safety and alignment of LLMs, it is possible that the self-defense system could be exploited or misused in unanticipated ways. Careful monitoring and ongoing evaluation of the system's behavior will be essential to address this concern.

Overall, the SelfDefend approach represents an important contribution to the field of AI safety and alignment, and the authors have provided a solid foundation for further research and development in this critical area.

Conclusion

The paper presents a practical method called "SelfDefend" that enables large language models (LLMs) to effectively defend themselves against jailbreaking attacks. The authors demonstrate that by training LLMs to constantly monitor their own responses and detect potential threats, the models can take appropriate defensive actions to resist harmful or unintended requests.

This research is a significant step towards enhancing the robustness and security of advanced AI systems, addressing a crucial challenge in the field of AI safety and alignment. As LLMs continue to grow in capability and become increasingly integrated into various applications, the ability of these models to protect themselves from misuse or manipulation will be paramount.

While the SelfDefend approach shows promise, the authors acknowledge the need for ongoing refinement and evaluation to address potential limitations and unintended consequences. Nonetheless, this work represents an important advancement in the quest to develop AI systems that are reliable, trustworthy, and aligned with human values and safety considerations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

💬

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at url{https://github.com/arobey1/smooth-llm}.

6/17/2024

cs.LG cs.AI stat.ML

💬

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, Minlie Huang

While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the intrinsic conflict between the goals of being helpful and ensuring safety. Accordingly, we propose to integrate goal prioritization at both training and inference stages to counteract. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And integrating goal prioritization into model training reduces the ASR from 71.0% to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks, both because of their stronger ability in instruction following. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety. Our code is available at url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.

6/13/2024

cs.CL

Defending LLMs against Jailbreaking Attacks via Backtranslation

Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh

Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at url{https://github.com/YihanWang617/llm-jailbreaking-defense}, and the code for reproducing our experiments is available at url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}.

6/10/2024

cs.CL cs.AI