GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Read original: arXiv:2402.03299 - Published 6/3/2024 by Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, Haohan Wang

💬

Overview

Researchers propose a novel system called GUARD (Guideline Upholding through Adaptive Role-play Diagnostics) to proactively test large language models (LLMs) for safety and mitigate the risk of harmful "jailbreak" responses.
GUARD uses a role-playing approach where different user LLMs collaborate to generate new jailbreaks, leveraging a knowledge graph of existing jailbreak characteristics.
The system also includes an automated setting to generate jailbreaks that follow government-issued guidelines, to test if LLMs comply accordingly.
GUARD has been validated on several state-of-the-art LLMs and extended to vision-language models, demonstrating its versatility.

Plain English Explanation

Large language models (LLMs) like ChatGPT have shown impressive capabilities, but they can also be susceptible to "jailbreaks" - techniques that bypass the models' safety filters and induce them to generate unethical or harmful responses. To address this, the researchers propose a novel system called GUARD that proactively tests LLMs for such safety issues.

GUARD uses a role-playing approach, where different user LLMs collaborate to come up with new jailbreaks. The researchers have also collected existing jailbreaks and organized them into a knowledge graph, making it easier to retrieve and leverage these characteristics to generate more jailbreaks.

Interestingly, GUARD includes a setting that automatically generates jailbreaks while following government guidelines, to test if the LLMs comply with these rules. This is an important step in ensuring the models behave responsibly and in line with ethical standards.

The researchers have validated GUARD on several state-of-the-art LLMs, including Vicuna-13B, LongChat-7B, and Llama-2-7B, as well as the widely used ChatGPT. They've also extended GUARD to vision-language models, showcasing its versatility across different AI modalities.

Technical Explanation

The researchers propose GUARD, a system that leverages a role-playing approach to proactively test large language models (LLMs) for safety and mitigate the risk of harmful "jailbreak" responses. GUARD assigns four different roles to user LLMs, which then collaborate to generate new jailbreaks.

To facilitate this process, the researchers have collected existing jailbreaks and split them into independent characteristics using clustering frequency and semantic patterns. These characteristics are then organized into a knowledge graph, making them more accessible and easier to retrieve.

GUARD's role-playing system leverages this knowledge graph to generate novel jailbreaks, which have proven effective in inducing LLMs to produce unethical or guideline-violating responses. Additionally, the researchers have pioneered a setting in GUARD that automatically generates jailbreaks while following government-issued guidelines, to test whether the LLMs comply accordingly.

The researchers have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B) as well as the widely-utilized commercial model ChatGPT. Furthermore, the researchers have extended GUARD to the realm of vision-language models (MiniGPT-v2 and Gemini Vision Pro), showcasing its versatility and valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.

Critical Analysis

The researchers' approach of using a role-playing system and a knowledge graph to generate jailbreaks is an innovative and promising strategy for proactively testing the safety of large language models. By leveraging existing jailbreak characteristics and allowing different user LLMs to collaborate, the system can efficiently create new jailbreaks that can challenge the models' safety measures.

However, the paper does not address the potential ethical concerns around the development and use of such a system. While the researchers include a setting to generate jailbreaks that follow government guidelines, there may be questions around the broader implications of creating tools that can bypass LLM safety filters, even for the purpose of testing.

Additionally, the paper does not delve into the potential limitations or biases inherent in the collected jailbreak data or the knowledge graph. The diversity and representativeness of the existing jailbreaks may impact the system's ability to generate comprehensive and unbiased testing scenarios.

Further research could explore ways to ensure the ethical and responsible development and deployment of such jailbreak testing systems, as well as investigate methods to continuously update and improve the knowledge graph to keep pace with the evolving landscape of large language models and their safety challenges.

Conclusion

The researchers have developed a novel system called GUARD that proactively tests large language models for safety and mitigates the risk of harmful "jailbreak" responses. GUARD uses a role-playing approach where different user LLMs collaborate to generate new jailbreaks, leveraging a knowledge graph of existing jailbreak characteristics.

The system also includes an automated setting to generate jailbreaks that follow government-issued guidelines, to test if LLMs comply accordingly. GUARD has been validated on several state-of-the-art LLMs and extended to vision-language models, demonstrating its versatility.

This research represents an important step in addressing the safety and reliability challenges of large language models, as the ability to identify and mitigate potential jailbreaks can help ensure these powerful AI systems are deployed responsibly and ethically. However, continued efforts are needed to address the broader implications and limitations of such jailbreak testing tools.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, Haohan Wang

The discovery of jailbreaks to bypass safety filters of Large Language Models (LLMs) and harmful responses have encouraged the community to implement safety measures. One major safety measure is to proactively test the LLMs with jailbreaks prior to the release. Therefore, such testing will require a method that can generate jailbreaks massively and efficiently. In this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. We propose a role-playing system that assigns four different roles to the user LLMs to collaborate on new jailbreaks. Furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. We organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. Our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing LLMs to generate unethical or guideline-violating responses. In addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether LLMs follow the guidelines accordingly. We refer to our system as GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have empirically validated the effectiveness of GUARD on three cutting-edge open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing GUARD's versatility and contributing valuable insights for the development of safer, more reliable LLM-based applications across diverse modalities.

6/3/2024

Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

Haibo Jin, Andy Zhou, Joe D. Menke, Haohan Wang

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ($sim$ $times$ 19.88) and lower filtered-out rates ($sim$ $times$ 1/6) than baselines.

6/3/2024

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

💬

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

5/16/2024