SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

2310.03684

Published 6/17/2024 by Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

💬

Abstract

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at url{https://github.com/arobey1/smooth-llm}.

Create account to get full access

Overview

Researchers propose SmoothLLM, the first algorithm designed to mitigate "jailbreaking" attacks on large language models (LLMs) like GPT, Llama, and Claude.
Jailbreaking attacks involve fooling an LLM into generating objectionable content, despite efforts to align these models with human intentions.
SmoothLLM addresses this vulnerability by randomly perturbing input prompts and aggregating the corresponding predictions to detect adversarial inputs.

Plain English Explanation

Large language models (LLMs) like GPT, Llama, and Claude are powerful AI systems that can generate human-like text. However, these models can sometimes be tricked into producing inappropriate or harmful content through a technique called "jailbreaking." In a jailbreaking attack, an adversary finds a way to fool the LLM into ignoring its intended safeguards and generating objectionable text.

To address this vulnerability, researchers have developed a new algorithm called SmoothLLM. The key idea behind SmoothLLM is that the prompts used to generate adversarial content are "brittle" - small changes to the prompt can cause the LLM to produce very different outputs. By randomly making small changes to the input prompt and then aggregating the model's responses, SmoothLLM can detect when an input is part of a jailbreaking attack.

Compared to other approaches, SmoothLLM sets the state-of-the-art for robustness against various jailbreaking attacks, including the GCG, PAIR, RandomSearch, and AmpleGCG attacks. It is also resistant to more advanced "adaptive" jailbreaking attacks. While SmoothLLM does come with a small trade-off in the model's nominal performance, the researchers believe this is a small price to pay for the significant improvements in safety and security.

Technical Explanation

The researchers behind SmoothLLM observed that the prompts used in jailbreaking attacks, such as those described in this paper, are "brittle" - small character-level changes to the prompt can cause the target LLM to generate very different outputs. Building on this insight, they developed SmoothLLM, a defense mechanism that works by randomly perturbing multiple copies of a given input prompt and then aggregating the corresponding model predictions.

The key steps of the SmoothLLM algorithm are:

Prompt Perturbation: Generate multiple slightly modified versions of the input prompt by randomly applying small character-level changes.
Prediction Aggregation: Pass each perturbed prompt through the target LLM and aggregate the resulting predictions (e.g., by taking the mean).
Adversarial Input Detection: Compare the aggregated prediction to the original model's output. If the difference exceeds a certain threshold, flag the input as potentially adversarial.

Through extensive experiments across a range of popular LLMs, the researchers demonstrate that SmoothLLM outperforms other state-of-the-art defenses in terms of robustness against jailbreaking attacks. It is also effective against more advanced "adaptive" attacks, where the adversary tries to circumvent the defense.

While SmoothLLM does incur a small trade-off in nominal model performance, the researchers argue that this is a reasonable compromise given the significant improvements in safety and security. As the authors note, "the ability to reliably detect and mitigate jailbreaking attacks is crucial for the responsible deployment of large language models in real-world applications."

Critical Analysis

The researchers have made a compelling case for the effectiveness of SmoothLLM in mitigating jailbreaking attacks on large language models. However, the paper also acknowledges several limitations and areas for further research:

Computational Overhead: The prompt perturbation and prediction aggregation steps in SmoothLLM add computational complexity, which could impact the model's inference speed in real-world applications. The researchers mention that optimizing the efficiency of SmoothLLM is an important area for future work.
Adaptive Attacks: While SmoothLLM is resistant to the adaptive GCG attack presented in the paper, it is possible that more sophisticated adaptive techniques could still evade the defense. Continuing to study the resilience of SmoothLLM against evolving jailbreaking strategies is crucial.
Generalization to Other Attacks: The researchers have primarily evaluated SmoothLLM against the specific jailbreaking attacks mentioned in the paper. It would be valuable to assess the defense's performance against a broader range of potential jailbreaking techniques, including those that may emerge in the future.
Ethical Considerations: The paper focuses on the technical aspects of the defense, but there may be important ethical considerations around the development and deployment of such systems. For example, the potential for false positives and the impact on user trust and privacy should be carefully examined.

Despite these limitations, the comprehensive study of jailbreaking attacks and defenses presented in this paper represents a significant contribution to the field of LLM safety and security. As large language models become more ubiquitous, continued research and innovation in this area will be crucial for ensuring the responsible and trustworthy use of these powerful AI systems.

Conclusion

In response to the growing threat of jailbreaking attacks on large language models, the researchers have developed SmoothLLM, a novel algorithm that can effectively detect and mitigate these malicious attempts to bypass the models' safety mechanisms. By randomly perturbing input prompts and aggregating the corresponding predictions, SmoothLLM sets a new state-of-the-art for robustness against a range of jailbreaking attacks, including more advanced adaptive techniques.

While SmoothLLM does come with a small trade-off in nominal model performance, the researchers argue that this is a reasonable compromise given the critical importance of ensuring the safe and responsible deployment of large language models. As these powerful AI systems become increasingly integrated into our lives, continued advancements in jailbreaking defenses like SmoothLLM will be essential for maintaining trust and preserving the beneficial applications of this transformative technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, Jun Sun

Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs response to harmful prompts and propose a novel defense method termed textbf{L}ayer-specific textbf{Ed}iting (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical textit{safety layers} exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from selected target layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at url{https://github.com/ledllm/ledllm}.

6/17/2024

cs.AI

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token ``Sure''), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.

6/19/2024

cs.CR cs.AI cs.LG stat.ML

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

🛠️

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Andy Zhou, Bo Li, Haohan Wang

Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art. Code can be found at https://github.com/lapisrocks/rpo

6/7/2024

cs.LG cs.AI cs.CL cs.CV