Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

2405.20413

Published 6/3/2024 by Haibo Jin, Andy Zhou, Joe D. Menke, Haohan Wang

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Abstract

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ($sim$ $times$ 19.88) and lower filtered-out rates ($sim$ $times$ 1/6) than baselines.

Create account to get full access

Overview

This paper investigates techniques for "jailbreaking" or bypassing content moderation safeguards in large language models (LLMs).
The researchers explore how to generate text that evades detection by LLM moderation systems using "cipher characters" - obscure Unicode characters that can be used to slip past content filters.
They demonstrate the effectiveness of these techniques on popular LLMs like GPT-3 and show how they can be used to generate harmful or deceptive content.
The paper raises important questions about the robustness and reliability of current LLM moderation approaches in the face of increasingly sophisticated evasion tactics.

Plain English Explanation

The paper describes a way to trick large AI language models, like GPT-3, into generating harmful or deceptive content that can bypass the models' content moderation safeguards. The researchers found that by using obscure Unicode characters, called "cipher characters," they could create text that looks innocuous to the model's filters but actually contains dangerous or misleading information.

This work demonstrates how current content moderation systems for large language models can be vulnerable to these kinds of evasion tactics. Even if an AI model is designed to block the generation of harmful content, determined users may be able to find clever ways around those protections. This raises concerns about the reliability and robustness of AI systems, especially as they become more advanced and widely deployed.

The paper highlights the need for more rigorous and adaptable content moderation approaches that can keep pace with increasingly sophisticated attempts to bypass them. As large language models become more powerful and ubiquitous, ensuring they are used safely and responsibly will be a critical challenge for the AI research community and wider society.

Technical Explanation

The paper investigates techniques for "jailbreaking" or bypassing content moderation safeguards in large language models (LLMs) like GPT-3. The researchers explore the use of "cipher characters" - obscure Unicode characters that can be used to slip past text filtering systems.

Through a series of experiments, the authors demonstrate the effectiveness of these cipher character techniques in generating text that evades detection by LLM moderation systems. They show how this approach can be used to produce content that violates the models' safety constraints, such as hate speech, disinformation, and other harmful outputs.

The paper provides a detailed technical analysis of the underlying mechanisms behind these jailbreaking attacks. It examines how the unique properties of Unicode characters, combined with the limitations of current LLM filtering approaches, can be leveraged to bypass moderation. The researchers also discuss potential mitigations and the broader implications for the robustness and reliability of LLM content moderation.

Critical Analysis

The research presented in this paper highlights the significant challenges in developing robust content moderation systems for large language models. While the authors demonstrate the effectiveness of their cipher character techniques, it is concerning that such evasion tactics are possible in the first place.

One key limitation of the work is that it only focuses on text-based attacks, whereas many real-world LLM applications also involve multimodal content (e.g., text, images, and audio). It would be valuable to see if similar jailbreaking approaches can be extended to these more complex scenarios.

Additionally, the paper does not provide a comprehensive analysis of the potential societal harms that could arise from the unconstrained generation of harmful content using these techniques. Further research is needed to understand the full implications and develop more robust mitigations.

Overall, this work serves as an important wake-up call for the AI research community. It underscores the need for more rigorous and adaptable content moderation approaches that can keep pace with increasingly sophisticated attempts to bypass them. Addressing these challenges will be crucial as large language models become more powerful and ubiquitous.

Conclusion

This paper presents a concerning exploration of techniques for "jailbreaking" large language models and bypassing their content moderation safeguards. By leveraging obscure Unicode characters, the researchers demonstrate how text can be generated that evades detection by LLM filtering systems, potentially enabling the production of harmful or deceptive content.

The findings of this study raise significant questions about the reliability and robustness of current LLM moderation approaches. As these models become more advanced and widely deployed, ensuring their safe and responsible use will be a critical challenge. This work highlights the need for more rigorous and adaptable content moderation solutions that can keep pace with evolving evasion tactics.

Ultimately, this research serves as a crucial wake-up call for the AI community. It underscores the importance of developing more robust and comprehensive safeguards to protect against the misuse of powerful language models. Addressing these challenges will be essential as we strive to harness the potential of large language models while mitigating their risks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

5/16/2024

cs.CR cs.LG

💬

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, Eric Wong

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

6/18/2024

cs.CR cs.LG

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

💬

Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak

Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.

5/8/2024

cs.CL cs.AI cs.CR cs.LG