Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

2311.09096

Published 6/13/2024 by Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, Minlie Huang

💬

Abstract

While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the intrinsic conflict between the goals of being helpful and ensuring safety. Accordingly, we propose to integrate goal prioritization at both training and inference stages to counteract. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And integrating goal prioritization into model training reduces the ASR from 71.0% to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks, both because of their stronger ability in instruction following. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety. Our code is available at url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.

Create account to get full access

Overview

This paper discusses the issue of jailbreaking attacks on large language models (LLMs) and proposes a novel defense strategy.
Jailbreaking attacks aim to exploit weaknesses in LLMs to bypass their intended safety constraints and get them to produce harmful or undesirable outputs.
The researchers identify a key factor contributing to the success of these attacks: the inherent tension between an LLM's goal of being helpful and its goal of ensuring safety.
To address this, the researchers introduce a "goal prioritization" approach that can be integrated into both the training and inference stages of LLM development.

Plain English Explanation

The paper focuses on the problem of jailbreaking attacks on large language models (LLMs) like ChatGPT and Llama2-13B. These attacks try to bypass the safety measures built into these models to get them to produce harmful or undesirable outputs.

The researchers found that a key reason these attacks are successful is the inherent conflict between an LLM's goal of being helpful to users and its goal of staying safe and avoiding harmful outputs. This makes it difficult for the models to consistently prioritize safety over helpfulness.

To address this, the researchers propose a new approach called "goal prioritization" that can be incorporated into both the training and inference (output generation) stages of LLM development. This helps the models better balance their competing goals and substantially reduces the success rate of jailbreaking attacks, from around 66% down to just 3.6% for ChatGPT.

Interestingly, the researchers also found that stronger LLMs actually have a greater capacity to defend against these attacks, thanks to their stronger ability to follow instructions. This suggests that as LLMs become more capable, they may also become better equipped to protect themselves from malicious attempts to misuse them.

Technical Explanation

The researchers conducted a series of experiments to assess the effectiveness of their goal prioritization approach for defending against jailbreaking attacks on LLMs.

During the inference stage, they implemented a mechanism that allowed the model to dynamically adjust the prioritization of its safety and helpfulness goals based on the input prompt. This substantially reduced the attack success rate from 66.4% to just 3.6% for ChatGPT.

They also integrated goal prioritization into the model training process, which reduced the attack success rate from 71.0% to 6.6% for the Llama2-13B model. Remarkably, even in scenarios where no jailbreaking samples were included during training, this approach still halved the attack success rate.

The researchers attribute these improvements to the models' enhanced ability to balance their helpfulness and safety goals, which makes it much harder for attackers to bypass the intended safety constraints.

Critical Analysis

The researchers acknowledge several caveats and limitations in their work. For example, they note that their goal prioritization approach may not be effective against more sophisticated jailbreaking techniques that can bypass the models' internal goal management systems.

Additionally, the researchers suggest that further research is needed to understand the relationship between an LLM's capability and its capacity for self-defense. While they found that stronger models can better defend against jailbreaking, it's unclear if this trend will continue as models become even more powerful.

It's also worth considering whether the goal prioritization approach could have unintended consequences, such as making the models overly cautious or risk-averse in their interactions with users. Careful evaluation of the real-world impact of this defense strategy will be crucial.

Conclusion

This paper makes a valuable contribution to the understanding of jailbreaking attacks on LLMs and proposes a promising defense strategy based on goal prioritization. By addressing the inherent tension between an LLM's helpfulness and safety goals, the researchers have developed an approach that can significantly reduce the success of these attacks.

As LLMs continue to grow in capability and become more widely deployed, the importance of robust safety measures like the one described in this paper will only increase. The researchers' findings suggest that as LLMs become more powerful, they may also become better equipped to defend themselves against malicious attempts to misuse them, which could have important implications for the future development and deployment of these transformative technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu, Juergen Rahmel

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into four major categories: optimization-based attacks such as Greedy Coordinate Gradient (GCG), jailbreak template-based attacks such as Do-Anything-Now, advanced indirect attacks like DrAttack, and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delay to user prompts, as well as be compatible with both open-source and closed-source LLMs. Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM defense instance to concurrently protect the target LLM instance in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs (both target and defense LLMs) have the capability to identify harmful prompts or intentions in user queries, which we empirically validate using the commonly used GPT-3.5/4 models across all major jailbreak attacks. Our measurements show that SelfDefend enables GPT-3.5 to suppress the attack success rate (ASR) by 8.97-95.74% (average: 60%) and GPT-4 by even 36.36-100% (average: 83%), while incurring negligible effects on normal queries. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. These models outperform four SOTA defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. We also empirically show that the tuned models are robust to targeted GCG and prompt injection attacks.

6/11/2024

cs.CR cs.AI

💬

Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak

Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.

5/8/2024

cs.CL cs.AI cs.CR cs.LG

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

4/15/2024

cs.CR cs.AI cs.CL