Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

2406.09324

Published 6/14/2024 by Zhao Xu, Fan Liu, Hao Liu

🌀

Abstract

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we evaluate the impact of various attack settings on LLM performance and provide a baseline benchmark for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 320 experiments with about 50,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/Bag_of_Tricks_for_LLM_Jailbreaking.

Create account to get full access

Overview

Large Language Models (LLMs) have demonstrated impressive capabilities, but they are vulnerable to jailbreak attacks that can manipulate them to produce harmful outputs.
Recent research has categorized jailbreak attacks into token-level and prompt-level attacks, but previous work has overlooked key factors and focused mainly on LLM vulnerabilities rather than defense-enhanced models.
This paper aims to address these issues by evaluating the impact of various attack settings on LLM performance and providing a baseline benchmark for jailbreak attacks.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can perform complex tasks, like generating human-like text, with minimal guidance. However, these models can also be manipulated to produce harmful or unintended outputs through jailbreak attacks.

Researchers have identified two main types of jailbreak attacks: token-level attacks, which target individual words or phrases, and prompt-level attacks, which manipulate the overall instructions given to the model. While previous studies have focused on these vulnerabilities, they haven't thoroughly explored the diverse factors that can influence the success of jailbreak attacks.

This paper aims to fill that gap by evaluating the impact of various attack settings on LLM performance. The researchers also provide a standardized benchmark to help other researchers and developers assess the effectiveness of defenses against these types of attacks.

Technical Explanation

The researchers conducted a comprehensive study to understand the key factors that influence the success of jailbreak attacks on LLMs. They evaluated eight different factors, including the target-level (e.g., the specific task or dataset) and attack-level (e.g., the type of attack) perspectives.

To establish a baseline, the team carried out seven representative jailbreak attacks on six different defense methods across two widely used datasets. This encompassed approximately 320 experiments, totaling around 50,000 GPU hours on an A800-80G system.

The experimental results highlight the need for a standardized benchmarking framework to assess the robustness of defense-enhanced LLMs against jailbreak attacks. The researchers have made their code available at https://github.com/usail-hkust/Bag_of_Tricks_for_LLM_Jailbreaking to encourage the adoption of this evaluation framework.

Critical Analysis

The paper provides a comprehensive and systematic approach to understanding the factors that influence the success of jailbreak attacks on LLMs. However, the researchers acknowledge that their study is not exhaustive, and there may be additional factors or attack vectors that were not considered.

Additionally, the paper focuses on evaluating the performance of LLMs under various attack settings, but it does not delve deeply into the underlying mechanisms or vulnerabilities that enable these attacks. Further research may be needed to develop more robust and resilient LLM architectures that can withstand a wider range of jailbreak attacks.

It's also worth noting that the researchers used a specific hardware configuration (A800-80G) for their experiments, which may not be representative of all LLM deployment scenarios. The performance and effectiveness of the attacks and defenses may vary depending on the hardware and software environment.

Conclusion

This paper presents a valuable contribution to the field of LLM security by systematically evaluating the impact of jailbreak attacks and establishing a standardized benchmark for assessing the robustness of defense-enhanced LLMs. The findings highlight the need for continued research and development in this area to ensure the safe and responsible deployment of these powerful AI systems.

By providing a comprehensive analysis and an open-source benchmarking framework, the researchers have laid the groundwork for further advancements in LLM security and paved the way for the development of more robust and reliable AI-powered applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

6/18/2024

cs.CR cs.LG

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

4/16/2024

cs.CR cs.CL cs.HC

JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.

4/19/2024

cs.CR cs.AI cs.CL