Intention Analysis Makes LLMs A Good Jailbreak Defender

2401.06561

Published 4/30/2024 by Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao

Intention Analysis Makes LLMs A Good Jailbreak Defender

Abstract

Aligning large language models (LLMs) with human values, particularly in the face of complex and stealthy jailbreak attacks, presents a formidable challenge. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($mathbb{IA}$). The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process: 1) essential intention analysis, and 2) policy-aligned response. Notably, $mathbb{IA}$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across ChatGLM, LLaMA2, Vicuna, MPT, DeepSeek, and GPT-3.5 show that $mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -53.1% attack success rate) and maintain the general helpfulness. Encouragingly, with the help of our $mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 in terms of attack success rate. Further analyses present some insights into how our method works. To facilitate reproducibility, we release our code and scripts at: https://github.com/alphadl/SafeLLM_with_IntentionAnalysis.

Create account to get full access

Overview

This paper explores the use of "intention analysis prompting" to make large language models (LLMs) more robust against "jailbreak" attacks, which aim to bypass the models' safety constraints.
The researchers propose a method that analyzes the user's underlying intentions behind prompts and uses that analysis to determine whether the prompt is likely to result in harmful or unsafe outputs.
The paper presents experimental results showing that this approach can effectively defend against several types of jailbreak attacks, including alignment-breaking adversarial attacks, hidden malicious goals, and visual jailbreak attacks.

Plain English Explanation

The paper is about making large language models, which are powerful AI systems that can generate human-like text, more secure and reliable. These models are often used in helpful applications like answering questions or summarizing information. However, there is a risk that someone could try to "trick" the model into producing harmful or unsafe outputs, a type of attack known as a "jailbreak."

To address this, the researchers developed a technique called "intention analysis prompting." The idea is to analyze the user's underlying intentions when they provide a prompt to the language model. By understanding the user's true goal, the system can better determine whether the prompt is likely to lead to something unsafe or undesirable.

For example, if a user asks the language model to "write a terrorist manifesto," the intention analysis would recognize that this is likely a harmful request, even though the prompt itself doesn't directly say anything dangerous. The system could then refuse to generate that content.

The paper shows through experiments that this approach is effective at defending against several different types of jailbreak attacks, including adversarial attacks that try to bypass the model's safety constraints, hidden malicious goals embedded in seemingly benign prompts, and even visual jailbreak attacks.

Overall, this research is an important step towards making large language models more reliable and trustworthy, which is crucial as these models become more widely used in real-world applications.

Technical Explanation

The core of this paper's approach is an "intention analysis prompting" module that is integrated into the language model's architecture. This module examines the user's prompt and attempts to infer their underlying intentions or goals.

To do this, the researchers trained a separate neural network model on a large dataset of human-written prompts and the corresponding user intentions. This allowed the intention analysis module to learn patterns that associate certain prompts with specific types of user goals, such as seeking information, expressing creativity, or attempting to bypass the model's safety constraints.

When a user provides a new prompt, the intention analysis module classifies the prompt according to these learned intention categories. Based on the inferred intention, the language model can then decide whether to generate the requested output or refuse the prompt if it appears to have a harmful or unsafe intent.

The paper evaluates this approach on several different jailbreak attack scenarios, including alignment-breaking adversarial attacks, hidden malicious goals, and visual jailbreak attacks. The results show that the intention analysis prompting system is effective at detecting and blocking these types of attacks, significantly outperforming baseline language models without this additional safety mechanism.

Critical Analysis

The paper presents a promising approach to making large language models more robust against jailbreak attacks. The intention analysis prompting technique seems well-designed and the experimental results are compelling. However, there are a few potential limitations and areas for further research:

The intention analysis model's performance is heavily dependent on the quality and coverage of the training data. If the dataset of human-written prompts does not capture the full range of possible malicious intents, the model may miss certain types of jailbreak attacks.
The paper does not address the potential for "adversarial prompts" that could fool the intention analysis module itself. An attacker may be able to craft prompts that are designed to bypass the intention classifier, even if the underlying intent is harmful.
The authors acknowledge that their approach may introduce some overhead and latency compared to a baseline language model, as the additional intention analysis step is required. Further research is needed to optimize the efficiency of this approach.
It would be valuable to see how this technique performs in real-world, large-scale deployments of language models, where the diversity and complexity of user prompts may pose additional challenges.

Despite these potential limitations, this research represents an important step forward in rethinking how to evaluate the safety and robustness of language models. The intention analysis prompting approach shows promise as a way to make large language models more resilient against jailbreak attacks and increase trust in their outputs.

Conclusion

This paper introduces a novel "intention analysis prompting" technique that enhances the safety and reliability of large language models by detecting and blocking potential jailbreak attacks. Through a series of experiments, the researchers demonstrate the effectiveness of this approach in defending against various types of adversarial and malicious prompts.

While the proposed method has some limitations that warrant further investigation, this research represents an important advancement in the field of language model safety and robustness. As these powerful AI systems become more widely deployed, ensuring their trustworthiness and alignment with human values is crucial. The intention analysis prompting technique shown in this paper is a promising step towards that goal.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, Minlie Huang

While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the intrinsic conflict between the goals of being helpful and ensuring safety. Accordingly, we propose to integrate goal prioritization at both training and inference stages to counteract. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And integrating goal prioritization into model training reduces the ASR from 71.0% to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks, both because of their stronger ability in instruction following. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety. Our code is available at url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.

6/13/2024

cs.CL

🤷

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

Shang Shang, Xinqiang Zhao, Zhongjiang Yao, Yepeng Yao, Liya Su, Zijing Fan, Xiaodan Zhang, Zhengwei Jiang

To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: Obscure Intention and Create Ambiguity, which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.

5/8/2024

cs.CR cs.AI

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, Yisen Wang

Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. We offer theoretical insights to elucidate how a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety of LLMs.

5/28/2024

cs.LG cs.AI cs.CL cs.CR