GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

2309.10253

Published 6/28/2024 by Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Abstract

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

Create account to get full access

Overview

This paper presents GPTFuzzer, a tool for "red teaming" large language models (LLMs) like GPT-4 by automatically generating "jailbreak" prompts that attempt to bypass the models' safety constraints.
The authors demonstrate that GPTFuzzer can find prompts that cause LLMs to generate harmful or undesirable content, highlighting potential security vulnerabilities.
The paper also provides a comprehensive study of the effectiveness of different jailbreak attack strategies against LLM defenses.

Plain English Explanation

The paper describes a tool called GPTFuzzer that is designed to test the security of large language models (LLMs) like GPT-4. LLMs are powerful AI systems that can generate human-like text, but they often have safety constraints in place to prevent them from producing harmful or undesirable content.

GPTFuzzer automatically generates "jailbreak" prompts - prompts that are designed to bypass these safety constraints and cause the LLM to generate harmful content. By testing LLMs with these jailbreak prompts, the researchers were able to identify potential security vulnerabilities in the models.

The paper also provides a detailed study of different jailbreak attack strategies and how effective they are at bypassing the defenses that LLM developers have put in place to try to prevent such attacks.

Technical Explanation

The paper presents GPTFuzzer, a tool for "red teaming" large language models (LLMs) by automatically generating "jailbreak" prompts. Jailbreak prompts are designed to bypass the safety constraints that LLM developers put in place to prevent the models from generating harmful or undesirable content.

The authors describe several techniques for generating jailbreak prompts, including using nested prompts, adversarial examples, and prompt scaling. They evaluate the effectiveness of these techniques on a range of LLMs, including GPT-4, and find that GPTFuzzer is able to generate prompts that cause the models to produce harmful content.

The paper also presents a comprehensive study of different jailbreak attack strategies and how effective they are at bypassing the defenses that LLM developers have put in place, such as SmoothLLM.

Critical Analysis

The paper provides a valuable contribution to the field of LLM security by demonstrating that even powerful and heavily-secured models like GPT-4 can be vulnerable to jailbreak attacks. The authors' development of GPTFuzzer is an important step in understanding the limits of current LLM safety measures.

However, the paper also acknowledges several limitations of the research. For example, the authors note that their study only examined a limited set of LLMs and jailbreak techniques, and that further research is needed to fully characterize the scope of the problem.

Additionally, while the paper highlights the potential risks of jailbreak attacks, it does not provide detailed guidance on how to effectively mitigate these risks. More research is needed to develop robust and scalable defenses against such attacks.

Conclusion

This paper presents a novel tool, GPTFuzzer, that can be used to test the security of large language models by automatically generating jailbreak prompts. The authors demonstrate that even highly-secure LLMs like GPT-4 can be vulnerable to such attacks, highlighting the need for ongoing research and development in the area of LLM safety and security.

The findings of this paper have important implications for the developers and users of LLMs, as well as for the broader field of AI safety. By continually pushing the boundaries of what is possible with LLM attacks, researchers can help to ensure that these powerful models are developed and deployed in a responsible and secure manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM.

4/9/2024

cs.CL

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

💬

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

5/16/2024

cs.CR cs.LG

📉

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

Shuo Chen, Zhen Han, Bailan He, Zifeng Ding, Wenqian Yu, Philip Torr, Volker Tresp, Jindong Gu

Various jailbreak attacks have been proposed to red-team Large Language Models (LLMs) and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal Large Language Models (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found here https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md .

4/5/2024

cs.LG cs.CL cs.CR