Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Read original: arXiv:2308.03825 - Published 5/16/2024 by Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

💬

Overview

Researchers analyzed 1,405 "jailbreak" prompts used to bypass safeguards in large language models (LLMs) like ChatGPT
They identified 131 communities sharing these prompts and observed how they are evolving over time
Experiments showed that current LLM safeguards are not sufficient to defend against these jailbreak prompts in various harmful scenarios

Plain English Explanation

Large language models (LLMs) like ChatGPT have been designed with safeguards to prevent them from generating harmful or unethical content. However, a type of prompt known as a "jailbreak prompt" has emerged as a way to bypass these safeguards and elicit dangerous responses from the models.

The researchers in this study used a new framework called JailbreakHub to analyze over 1,400 of these jailbreak prompts collected from December 2022 to December 2023. They identified over 130 online communities where people are sharing and optimizing these prompts. The researchers also observed that jailbreak prompts are now shifting from web forums to dedicated prompt-aggregation websites, and that some users have consistently refined effective jailbreak prompts over 100 days.

To assess the potential harm caused by these jailbreak prompts, the researchers created a dataset of 107,250 questions across 13 forbidden scenarios, like generating violent or hateful content. Testing this dataset on several popular LLMs, including ChatGPT and GPT-4, the researchers found that the models' safeguards were not adequate to defend against the jailbreak prompts in all cases. They identified 5 highly effective jailbreak prompts that could achieve a 95% success rate in bypassing the models' defenses.

The researchers hope that this study will help the research community and LLM vendors work towards developing safer and more regulated language models that are better equipped to handle these types of adversarial attacks. Link to paper on "Wolf in Sheep's Clothing"

Technical Explanation

The researchers employed their new JailbreakHub framework to conduct a comprehensive analysis of 1,405 jailbreak prompts collected over the course of a year. They identified 131 distinct online communities where these prompts were being shared and optimized.

Through their analysis, the researchers discovered unique characteristics of jailbreak prompts, such as the use of prompt injection and privilege escalation techniques to bypass model safeguards. They also observed a trend of jailbreak prompts shifting from web forums to dedicated prompt-aggregation websites, and noted that 28 user accounts had consistently refined effective jailbreak prompts over 100 days.

To assess the potential harm of these jailbreak prompts, the researchers created a dataset of 107,250 questions across 13 forbidden scenarios, including the generation of violent, hateful, or otherwise harmful content. Testing this dataset on 6 popular LLMs, they found that the models' safety mechanisms were not sufficient to defend against the jailbreak prompts in all cases.

Specifically, the researchers identified 5 highly effective jailbreak prompts that could achieve a 95% success rate in bypassing the defenses of ChatGPT (GPT-3.5) and GPT-4. They noted that the earliest of these prompts had persisted online for over 240 days, highlighting the persistent nature of this threat.

Link to paper on "JailbreakLens" Link to paper on "SubToxic Questions" Link to paper on "JailbreakV" Link to paper on "Rethinking Evaluations"

Critical Analysis

The researchers provide a comprehensive analysis of the jailbreak prompt phenomenon and its potential threats to the safety and security of large language models. However, the paper does not address some important limitations and caveats of the study.

For example, the dataset of 107,250 questions used to assess the models' defenses may not be representative of the full spectrum of potential harmful content that could be generated by jailbreak prompts. Additionally, the researchers only tested the prompts on 6 popular LLMs, and it's unclear how effective the prompts might be against other models or future iterations of the same models.

Another potential concern is the level of detail provided in the paper about the specific jailbreak prompts and their effectiveness. While this information is valuable for the research community and LLM vendors, it could also potentially be misused by bad actors to further refine and optimize these attacks.

Despite these limitations, the researchers have made a significant contribution to the understanding of jailbreak prompts and the need for more robust safeguards in large language models. Their work highlights the importance of ongoing research and collaboration between the research community, LLM vendors, and other stakeholders to address this emerging threat.

Conclusion

This study provides a comprehensive analysis of the growing problem of "jailbreak" prompts used to bypass the safeguards of large language models like ChatGPT. The researchers identified over 130 online communities where these prompts are being shared and optimized, and found that current LLM defenses are not adequate to defend against them in various harmful scenarios.

The findings of this research underscore the critical need for continued work to develop more robust and secure language models that can withstand these types of adversarial attacks. By collaborating with the research community and LLM vendors, the authors hope to facilitate the creation of safer and more regulated AI systems that can be responsibly deployed to benefit society.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

5/16/2024

💬

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM.

4/9/2024

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

4/16/2024

New!Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, Kwok-Yan Lam

Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query costs.In this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates, our method starts with an empty seed pool, removing the need to search for any related jailbreaking templates. We also develop three novel question-dependent mutation strategies using an LLM helper to generate prompts that maintain semantic coherence while significantly reducing their length. Additionally, we implement a two-level judge module to accurately detect genuine successful jailbreaks. We evaluated our method on 7 representative LLMs and compared it with 5 state-of-the-art jailbreaking attack strategies. For proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, our method achieves attack success rates of over 90%, 80%, and 74%, respectively, exceeding existing baselines by more than 60%. Additionally, our method can maintain high semantic coherence while significantly reducing the length of jailbreak prompts. When targeting GPT-4, our method can achieve over 78% attack success rate even with 100 tokens. Moreover, our method demonstrates transferability and is robust to state-of-the-art defenses. We will open-source our codes upon publication.

9/24/2024