JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

2404.03027

Published 4/19/2024 by Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, Chaowei Xiao

JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Abstract

With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces JailBreakV-28K, a benchmark for assessing the robustness of multimodal large language models (LLMs) against jailbreak attacks.
Jailbreak attacks attempt to bypass the safety constraints and ethical training of LLMs to get them to generate harmful or deceptive content.
The benchmark tests LLMs' ability to resist jailbreak attempts across a diverse set of 28,000 prompts.

Plain English Explanation

Large language models (LLMs) like ChatGPT have become incredibly powerful at understanding and generating human-like text. However, these models are also trained to behave ethically and avoid producing harmful or deceptive content. Jailbreak attacks are attempts to bypass these safety constraints and get the models to ignore their ethical training.

The JailBreakV-28K benchmark aims to measure how well different LLMs can resist such jailbreak attempts. It presents the models with over 28,000 carefully crafted prompts that try to trick them into generating unsafe or unethical content. The models' responses are then evaluated to see how well they were able to stay within their ethical boundaries.

This benchmark is important because it allows researchers and developers to better understand the vulnerabilities of these powerful AI systems. By identifying the types of prompts that can trip up the models, they can work to make the models more robust and resistant to jailbreak attacks in the future. This helps ensure that as LLMs become more capable, they remain aligned with human values and can be deployed safely.

Technical Explanation

The key elements of this paper are:

Benchmark Design: The authors created the JailBreakV-28K benchmark, which consists of 28,000 carefully curated prompts that test an LLM's ability to resist jailbreak attempts. The prompts cover a diverse range of topics and use various linguistic and rhetorical techniques to try to bypass the model's ethical constraints.
Evaluation Metrics: The authors define several metrics to assess an LLM's performance on the benchmark, including safety score (how well the model avoids generating harmful content), coherence score (how coherent and sensible the model's responses are), and overall robustness score.
Experiments: The authors evaluate several prominent LLMs, including GPT-3, InstructGPT, and Chinchilla, on the JailBreakV-28K benchmark. They analyze the models' performance across different prompt types and discuss the strengths and weaknesses of each model.
Insights: The results reveal that current LLMs still have significant vulnerabilities when it comes to resisting jailbreak attacks. Even the most advanced models struggle with certain types of prompts and can sometimes be tricked into generating unethical or harmful content.

Critical Analysis

The authors acknowledge several limitations of their work. First, the benchmark, while extensive, may not capture all possible jailbreak attack vectors. There may be other types of prompts or techniques that could trip up the models in ways not tested here.

Additionally, the authors note that the benchmark evaluates the models' responses in isolation, without considering real-world deployment scenarios where other safeguards and monitoring systems may be in place. In a live system, there may be additional layers of protection that could mitigate the risks identified in the benchmark.

Further research is needed to address these limitations and continue improving the robustness of LLMs. Potential areas for future work include developing more advanced prompt generation techniques, testing the models in more realistic interactive settings, and exploring novel model architectures or training approaches that may be more resistant to jailbreak attacks.

Conclusion

The JailBreakV-28K benchmark provides a valuable tool for assessing the robustness of multimodal LLMs against jailbreak attacks. The results highlight the significant vulnerabilities that still exist in these powerful AI systems, even in the most advanced models. By identifying these weaknesses, the research can help drive the development of more secure and trustworthy LLMs that can be deployed safely in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community. Over time, we will expand and adapt the benchmark to reflect technical and methodological advances in the research community.

4/24/2024

cs.CR cs.LG

💬

New!A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

💬

Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak

Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.

5/8/2024

cs.CL cs.AI cs.CR cs.LG

💬

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

5/16/2024

cs.CR cs.LG