Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

2404.02151

Published 4/3/2024 by Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Abstract

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize the target logprob (e.g., of the token Sure), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). We provide the code, prompts, and logs of the attacks at https://github.com/tml-epfl/llm-adaptive-attacks.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Researchers present a new attack technique that can "jailbreak" leading large language models (LLMs) designed with safety features, allowing them to generate harmful content.
The attack is simple to implement and effective against prominent AI models, raising concerns about the robustness of current safety measures.
The paper explores the implications of this vulnerability and the need for more advanced security measures to protect against such attacks.

Plain English Explanation

The researchers have discovered a way to bypass the safety features built into some of the most advanced AI language models. These models are designed to avoid generating harmful or unethical content, but the researchers found a simple technique that can essentially "trick" the models into producing that kind of content anyway.

Imagine an AI assistant that's been trained to never say anything rude or dangerous. The researchers found a way to give the assistant instructions that make it ignore those safety rules and say whatever they want, including things that could be harmful. This is a concerning discovery, as it suggests that even the most sophisticated AI safety measures may have vulnerabilities that could be exploited.

The paper explores the implications of this attack and argues that more robust security measures are needed to protect these powerful language models from being misused. While the attack is relatively simple, it highlights the challenges of building truly safe and reliable AI systems that cannot be manipulated.

Technical Explanation

The paper presents a new attack technique called "simple adaptive attacks" that can effectively bypass the safety constraints of leading large language models (LLMs). The researchers target three prominent AI models – GPT-3, Anthropic's InstructGPT, and Anthropic's Claude – and demonstrate how their attack can induce these models to generate harmful and unethical content, despite the models' safety-aligned design.

The attack works by crafting prompts that exploit weaknesses in the models' training and prompt handling. The researchers find that even small modifications to the prompts can cause the models to disregard their normal safety restrictions and output content that violates their intended safeguards. They conduct extensive experiments to analyze the effectiveness and robustness of their attack across different prompts and model configurations.

The findings suggest that current approaches to aligning LLMs with safety objectives may be insufficient, as these models can be "jailbroken" through relatively simple techniques. The paper discusses the broader implications of this vulnerability, including the need for more advanced security measures and the challenges of building truly robust and reliable AI systems.

Critical Analysis

The researchers present a concerning vulnerability in the safety mechanisms of prominent large language models. Their simple adaptive attack technique highlights the difficulties in making these powerful AI systems truly secure and aligned with intended safety objectives.

While the attack is relatively straightforward to implement, it raises important questions about the effectiveness of current safety approaches. The researchers acknowledge that their work does not propose solutions to this problem, but rather aims to illuminate the challenges and spur further research in this area.

One potential limitation of the study is that it focuses on a specific set of language models and attack strategies. It would be valuable to see the researchers extend their analysis to a wider range of models and attack vectors to better understand the scope and generalizability of the issue.

Additionally, the paper does not delve deeply into the potential real-world consequences of such attacks or provide guidance on how to mitigate them. Exploring these aspects could further strengthen the impact and relevance of the findings.

Overall, the research highlights the need for more advanced security measures and a deeper understanding of the vulnerabilities in safety-aligned AI systems. Continued work in this area will be crucial as these models become increasingly influential in our lives.

Conclusion

The researchers have uncovered a concerning vulnerability in leading large language models, demonstrating how their safety features can be bypassed through relatively simple adaptive attacks. This discovery underscores the challenges of building truly robust and reliable AI systems that can withstand attempts to misuse or manipulate them.

The implications of this work are significant, as it suggests that current approaches to aligning LLMs with safety objectives may be insufficient. The paper calls for further research and the development of more advanced security measures to protect against such attacks and ensure the responsible deployment of these powerful AI technologies.

As language models continue to advance and become more integral to our daily lives, addressing the security and safety concerns raised in this research will be of paramount importance. The findings highlight the need for a multifaceted approach to AI development, one that prioritizes both innovation and responsible safeguards.

Related Papers

💬

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM.

4/9/2024

cs.CL

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

4/15/2024

cs.CR cs.AI cs.CL

📶

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Javier Rando, Francesco Croce, Kryv{s}tof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tram`er

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.

4/24/2024

cs.CL cs.AI cs.CR cs.LG

🤔

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian

While recently Large Language Models (LLMs) have achieved remarkable successes, they are vulnerable to certain jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires finding adversarial prompts that cause such jailbreaking, e.g. by appending a suffix to a given instruction, which is inefficient and time-consuming. On the other hand, automatic adversarial prompt generation often leads to semantically meaningless attacks that can easily be detected by perplexity-based filters, may require gradient information from the TargetLLM, or do not scale well due to time-consuming discrete optimization processes over the token space. In this paper, we present a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, $sim800times$ faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between two steps: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions, and (2) low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

4/29/2024

cs.CR cs.AI cs.CL cs.LG