Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

2404.14461

Published 4/24/2024 by Javier Rando, Francesco Croce, Kryv{s}tof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tram`er

cs.CL cs.AI cs.CR cs.LG

📶

Abstract

Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Large language models (LLMs) are designed to be safe and prevent users from generating harmful content like misinformation or instructions for illegal activities.
However, previous research has shown that the safety alignment process can be vulnerable to poisoning attacks.
Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command, allowing harmful responses from otherwise safe models.
A recent competition at IEEE SaTML 2024 challenged participants to find universal backdoors in several LLMs.

Plain English Explanation

Large language models are AI systems that can understand and generate human-like text. These models are designed to be "aligned" or programmed to behave in a safe and responsible way, preventing users from creating harmful content like false information or instructions for illegal activities.

However, previous research has shown that the process of aligning these models to be safe can have vulnerabilities. Malicious actors could potentially manipulate the data used to train the models' safety features, injecting hidden "backdoors" that allow them to bypass the models' safety controls.

Imagine a model that is normally very well-behaved, like a well-trained dog. But a crafty owner could train the dog with a secret "command" that makes it act aggressively, even though the dog is usually obedient. In the same way, these backdoors could let users get the language model to generate harmful content, even though the model is generally designed to be safe.

A recent competition at the IEEE SaTML 2024 conference challenged researchers to find these kinds of universal backdoors in several large language models. The goal was to better understand the vulnerabilities in how these models are made safe, and explore ways to make them more robust against such attacks.

Technical Explanation

The paper describes a competition that aimed to uncover potential vulnerabilities in the safety alignment process for large language models (LLMs). The organizers provided several LLMs that had been trained to behave safely and avoid generating harmful content. However, previous research has shown that this alignment process can be susceptible to "poisoning attacks," where adversaries manipulate the training data to inject hidden backdoors.

These backdoors act as a kind of "universal sudo command" - adding a specific string of text to any prompt can bypass the model's safety controls and enable the generation of harmful outputs, even though the model would normally behave safely. The competition challenged participants to find these universal backdoors in the provided LLMs.

The key insights and findings from the competition are not provided in the summary. However, the organizers note that the results shed light on important vulnerabilities in how current safety-aligned LLMs are developed, and point to promising avenues for future research, such as rethinking how to evaluate language model jailbreak and developing more robust benchmarks for jailbreaking large language models.

Critical Analysis

The summary highlights an important challenge in developing safe and responsible large language models. While these models are designed to avoid generating harmful content, the paper suggests that the current safety alignment process may have vulnerabilities that malicious actors could exploit.

One potential concern raised is the possibility of "backdoors" that could bypass a model's safety controls, similar to a secret "sudo" command. This highlights the need for more rigorous testing and evaluation of these safety mechanisms to ensure they are truly robust against such attacks.

Additionally, the competition's focus on finding universal backdoors across multiple LLMs suggests that the issue may be a widespread problem, rather than isolated to a few models. This underscores the importance of further research in this area, as highlighted by the organizers' suggestions for rethinking evaluation approaches and developing more comprehensive jailbreaking benchmarks.

It would also be valuable to explore other potential vulnerabilities in the safety alignment process, such as the potential for subtle, subversive prompts to gradually influence a model's behavior over time. Maintaining the safety and trustworthiness of large language models is a critical challenge that will require ongoing research and vigilance.

Conclusion

This report summarizes a competition that challenged researchers to find universal backdoors in safety-aligned large language models. The results suggest that the current safety alignment process may have vulnerabilities that could be exploited by malicious actors, allowing them to bypass a model's safety controls and generate harmful content.

The competition organizers highlight the need for further research to better understand these vulnerabilities and develop more robust approaches to ensuring the safety and trustworthiness of large language models. Addressing this challenge is crucial as these models become increasingly powerful and influential in our daily lives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Universal Jailbreak Backdoors from Poisoned Human Feedback

Javier Rando, Florian Tram`er

Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a jailbreak backdoor into the model. The backdoor embeds a trigger word into the model that acts like a universal sudo command: adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.

4/30/2024

cs.AI cs.CL cs.CR cs.LG

New!A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize the target logprob (e.g., of the token Sure), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). We provide the code, prompts, and logs of the attacks at https://github.com/tml-epfl/llm-adaptive-attacks.

4/3/2024

cs.CR cs.AI cs.LG

💬

Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak

Hongyu Cai, Arjun Arunasalam, Leo Y. Lin, Antonio Bianchi, Z. Berkay Celik

Large language models (LLMs) have become increasingly integrated with various applications. To ensure that LLMs do not generate unsafe responses, they are aligned with safeguards that specify what content is restricted. However, such alignment can be bypassed to produce prohibited content using a technique commonly referred to as jailbreak. Different systems have been proposed to perform the jailbreak automatically. These systems rely on evaluation methods to determine whether a jailbreak attempt is successful. However, our analysis reveals that current jailbreak evaluation methods have two limitations. (1) Their objectives lack clarity and do not align with the goal of identifying unsafe responses. (2) They oversimplify the jailbreak result as a binary outcome, successful or not. In this paper, we propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. Additionally, we demonstrate how these metrics correlate with the goal of different malicious actors. To compute these metrics, we introduce a multifaceted approach that extends the natural language generation evaluation method after preprocessing the response. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems. The benchmark dataset is labeled by three annotators. We compare our multifaceted approach with three existing jailbreak evaluation methods. Experiments demonstrate that our multifaceted evaluation outperforms existing methods, with F1 scores improving on average by 17% compared to existing baselines. Our findings motivate the need to move away from the binary view of the jailbreak problem and incorporate a more comprehensive evaluation to ensure the safety of the language model.

5/8/2024

cs.CL cs.AI cs.CR cs.LG