Jailbreaking Black Box Large Language Models in Twenty Queries

Read original: arXiv:2310.08419 - Published 7/22/2024 by Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong

💬

Overview

Growing interest in aligning large language models (LLMs) with human values
LLM alignment is vulnerable to adversarial "jailbreaks" that bypass safety guardrails
Identifying these vulnerabilities is crucial to understand weaknesses and prevent misuse

Plain English Explanation

Large language models (LLMs) like GPT-3 and ChatGPT are powerful AI systems that can understand and generate human-like text. However, there is growing concern that these models may not always align with human values and could potentially be misused. One key vulnerability is the risk of "jailbreaks" - techniques that can coax an LLM to override its built-in safety constraints and behave in unintended ways.

The research paper proposes a new algorithm called Prompt Automatic Iterative Refinement (PAIR) that can automatically generate these jailbreaking techniques. PAIR uses a separate "attacker" LLM to iteratively query the target LLM and refine a jailbreak prompt. This process allows PAIR to find effective jailbreaks in just 20 queries, which is much faster than previous methods.

The researchers tested PAIR on a variety of open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini. They found that PAIR was often able to bypass the models' safety constraints and achieve high success rates in "jailbreaking" the systems.

Technical Explanation

The researchers propose the Prompt Automatic Iterative Refinement (PAIR) algorithm to automatically generate semantic jailbreaks for large language models (LLMs). PAIR uses a separate "attacker" LLM to iteratively query the target LLM and refine a jailbreak prompt, without any human intervention.

The PAIR algorithm works as follows:

The attacker LLM generates an initial jailbreak prompt.
The attacker LLM uses this prompt to query the target LLM and observes the response.
The attacker LLM updates the prompt based on the target LLM's response, with the goal of gradually refining the jailbreak.
Steps 2-3 are repeated until a successful jailbreak is found or a maximum number of iterations is reached.

The researchers evaluated PAIR on a variety of open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini. They found that PAIR often required fewer than 20 queries to produce a successful jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieved competitive jailbreaking success rates and transferability across the tested models.

Critical Analysis

The researchers acknowledge that the PAIR algorithm, while effective, could potentially be misused to bypass the safety constraints of LLMs. They suggest that future work should focus on developing robust defenses against such jailbreaking techniques, such as adversarial tuning or smooth LLM approaches.

Additionally, the researchers note that their evaluation of PAIR was limited to textual jailbreaks and that further research is needed to explore the potential for multimodal jailbreaks involving other modalities like images or audio.

Overall, the research highlights the importance of continued efforts to ensure the alignment of LLMs with human values and the need for proactive strategies to address emerging vulnerabilities.

Conclusion

The paper proposes a new algorithm called Prompt Automatic Iterative Refinement (PAIR) that can efficiently generate semantic jailbreaks for large language models (LLMs). While this research is valuable in understanding the vulnerabilities of LLMs, the potential for misuse is a significant concern. Ongoing efforts to develop robust defenses and ensure the long-term alignment of these powerful AI systems with human values are crucial for their safe and beneficial deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.

7/22/2024

Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, Kwok-Yan Lam

Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query costs.In this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates, our method starts with an empty seed pool, removing the need to search for any related jailbreaking templates. We also develop three novel question-dependent mutation strategies using an LLM helper to generate prompts that maintain semantic coherence while significantly reducing their length. Additionally, we implement a two-level judge module to accurately detect genuine successful jailbreaks. We evaluated our method on 7 representative LLMs and compared it with 5 state-of-the-art jailbreaking attack strategies. For proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, our method achieves attack success rates of over 90%, 80%, and 74%, respectively, exceeding existing baselines by more than 60%. Additionally, our method can maintain high semantic coherence while significantly reducing the length of jailbreak prompts. When targeting GPT-4, our method can achieve over 78% attack success rate even with 100 tokens. Moreover, our method demonstrates transferability and is robust to state-of-the-art defenses. We will open-source our codes upon publication.

9/24/2024

💬

EnJa: Ensemble Jailbreak on Large Language Models

Jiahao Zhang, Zilong Wang, Ruofan Wang, Xingjun Ma, Yu-Gang Jiang

As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks -- malicious prompts that can disable the safety mechanism of LLMs -- has attracted growing research attention. While alignment methods have been proposed to protect LLMs from jailbreaks, many have found that aligned LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. Existing jailbreak attacks on LLMs can be categorized into prompt-level methods which make up stories/logic to circumvent safety alignment and token-level attack methods which leverage gradient methods to find adversarial tokens. In this work, we introduce the concept of Ensemble Jailbreak and explore methods that can integrate prompt-level and token-level jailbreak into a more powerful hybrid jailbreak attack. Specifically, we propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector. We evaluate the effectiveness of EnJa on several aligned models and show that it achieves a state-of-the-art attack success rate with fewer queries and is much stronger than any individual jailbreak.

8/9/2024

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li

Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of jailbreaking, which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.

9/2/2024