JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

2404.01318

Published 4/24/2024 by Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer and 2 others

cs.CR cs.LG

💬

Abstract

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community. Over time, we will expand and adapt the benchmark to reflect technical and methodological advances in the research community.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Large language models (LLMs) can sometimes generate harmful or unethical content when "jailbroken"
Evaluating these jailbreak attacks is challenging due to lack of standards, inconsistent reporting, and issues with reproducibility
To address these challenges, the researchers introduce JailbreakBench, an open-source benchmark with a dataset, adversarial prompt repository, evaluation framework, and performance leaderboard

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text on a wide range of topics. However, under certain conditions, these models can be prompted to produce harmful, unethical, or otherwise objectionable content. This is known as a "jailbreak" attack, where the model breaks free of its intended constraints and outputs dangerous information.

Evaluating these jailbreak attacks presents several challenges. First, there is no clear standard for how to properly test and measure these attacks. Different research teams use different methodologies, making it hard to compare results. Second, existing work reports costs and success rates in ways that are not directly comparable. Third, many studies are not reproducible, either because they don't share the adversarial prompts used, rely on closed-source code, or depend on evolving proprietary APIs.

To tackle these problems, the researchers created JailbreakBench, an open-source tool for evaluating jailbreak attacks on LLMs. JailbreakBench includes several key components:

A dataset of 100 unique "jailbreak behaviors" that the LLMs should not engage in
A repository of state-of-the-art adversarial prompts that can trigger these undesirable behaviors
A standardized evaluation framework with clear threat models, system prompts, and scoring functions
A leaderboard to track the performance of different attack and defense techniques

The researchers believe that releasing this benchmark will be a net positive for the research community, as it will help establish best practices and drive progress in making LLMs more robust against jailbreak attacks. Over time, they plan to expand and adapt the benchmark to keep pace with advancements in the field.

Technical Explanation

The paper identifies three key challenges in evaluating jailbreak attacks on large language models (LLMs):

Lack of standardization: There is no clear consensus on best practices for conducting jailbreak evaluations, leading to inconsistent methodologies across different studies.
Incomparable metrics: Existing works report costs and success rates in ways that cannot be directly compared, making it difficult to assess the relative performance of different attack and defense approaches.
Reproducibility issues: Many studies are not reproducible, either because they do not share the adversarial prompts used, rely on closed-source code, or depend on evolving proprietary APIs.

To address these challenges, the authors introduce JailbreakBench, an open-source benchmark with the following components:

JBB-Behaviors: A dataset of 100 unique "jailbreak behaviors" that LLMs should not engage in, such as generating hate speech or explicit content.
Jailbreak artifacts: An evolving repository of state-of-the-art adversarial prompts that can trigger these undesirable behaviors in LLMs.
Standardized evaluation framework: A clear threat model, system prompts, chat templates, and scoring functions to enable consistent and comparable evaluations.
Performance leaderboard: A system that tracks the performance of various attack and defense techniques against the benchmark.

The authors have carefully considered the ethical implications of releasing this benchmark and believe it will be a net positive for the research community. Over time, they plan to expand and adapt the benchmark to reflect technical and methodological advancements in the field.

Critical Analysis

The researchers have identified important challenges in evaluating jailbreak attacks on large language models and have taken a valuable first step in addressing them through the JailbreakBench framework.

One potential limitation of the current work is the scope of the benchmark, which focuses on 100 specific "jailbreak behaviors." While this provides a solid starting point, the researchers acknowledge that the benchmark will need to evolve to keep pace with emerging threats and attack vectors. Additionally, the choice of which behaviors to include in the dataset could be subjective, and there may be difficulty in defining clear boundaries for what constitutes "unethical" or "objectionable" content.

Another area for further exploration is the threat model and scoring functions used in the evaluation framework. The researchers have made reasonable choices, but there may be opportunities to refine these components to better capture the nuances of real-world jailbreak attacks and defenses.

Finally, the long-term maintenance and evolution of the JailbreakBench repository will be crucial to its success. The researchers will need to ensure that the benchmark remains relevant and up-to-date, while also addressing potential issues of bias, gaming, or other unintended consequences that may arise as the tool is more widely adopted.

Conclusion

The JailbreakBench framework represents an important step forward in establishing standards and best practices for evaluating jailbreak attacks on large language models. By providing a well-designed benchmark, the researchers aim to drive progress in making these powerful AI systems more robust and aligned with ethical principles.

As the field of AI safety continues to evolve, tools like JailbreakBench will be increasingly valuable in helping researchers, developers, and the general public understand the risks and mitigation strategies for harmful language model outputs. The open-source and collaborative nature of the benchmark also holds promise for fostering a more coordinated and effective response to the challenge of jailbreak attacks.

Related Papers

JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.

4/19/2024

cs.CR cs.AI cs.CL

💬

Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak

Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.

4/15/2024

cs.CL cs.AI cs.CR cs.LG

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

4/16/2024

cs.CR cs.CL cs.HC

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

4/15/2024

cs.CR cs.AI cs.CL