Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Read original: arXiv:2401.17263 - Published 7/10/2024 by Andy Zhou, Bo Li, Haohan Wang

🛠️

Overview

Advances in AI alignment have not fully addressed the vulnerability of large language models (LLMs) to adversarial attacks or "jailbreaking" - where prompts are modified to induce unwanted behavior.
While some defenses have been proposed, they have not kept up with the evolving threat landscape.
The paper introduces an optimization-based approach called Robust Prompt Optimization (RPO) to create system-level defenses against jailbreaking attacks on LLMs.

Plain English Explanation

Large language models (LLMs) like GPT-4 and Llama-2 are powerful AI systems that can generate human-like text. However, these models remain vulnerable to adversarial attacks or "jailbreaking" - where adversaries craft prompts that can make the model behave in unintended ways, such as generating harmful content.

While researchers have proposed some defenses against these attacks, the authors argue that these defenses have not kept pace with the evolving threat landscape. To address this, the researchers developed an approach called Robust Prompt Optimization (RPO) that aims to make LLMs more resistant to jailbreaking attacks.

The key idea behind RPO is to directly incorporate the adversary's objective into the defensive objective. This allows the defense to adapt to the worst-case attacks, rather than just the specific attacks seen during training. The authors show that this approach can significantly reduce the success rate of jailbreaking attacks on LLMs like GPT-4 and Llama-2.

Technical Explanation

The paper proposes an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO), to implement this defense.

The core of the RPO approach is to directly incorporate the adversary's objective into the defensive objective. This means the defense is optimized to work against the worst-case adaptive attacks, rather than just the specific attacks seen during training.

Specifically, the authors formulate the defense as a minimax optimization problem, where the defender tries to find the optimal "robust suffix" (a lightweight and transferable prompt) to append to the input, while the adversary tries to find the prompt that maximizes the likelihood of the unwanted behavior. By solving this minimax problem, the RPO algorithm can create a defense that is resilient to a broad range of jailbreaking attacks.

The authors demonstrate the effectiveness of RPO through theoretical analysis and extensive experiments on GPT-4 and Llama-2 using the JailbreakBench benchmark. They show that RPO can significantly reduce the attack success rate compared to prior defenses, reaching 6% on GPT-4 and 0% on Llama-2, setting a new state-of-the-art.

Critical Analysis

The paper presents a promising approach to defending LLMs against jailbreaking attacks, but it also acknowledges several caveats and limitations:

The defense is focused on system-level robustness and may not address all potential vulnerabilities, such as those arising from the model's multimodal nature.
The authors note that the defense may not be as effective against extremely powerful adversaries or in scenarios with very limited computational resources.
The paper does not explore the broader societal implications of these types of defenses, such as the potential for misuse or the impact on model transparency and interpretability.

Furthermore, while the results are impressive, it would be valuable to see the defense evaluated on a wider range of LLMs and attack scenarios to better understand its generalizability and limitations.

Conclusion

The paper presents a novel optimization-based approach called Robust Prompt Optimization (RPO) that aims to make large language models more resistant to jailbreaking attacks. By directly incorporating the adversary's objective into the defensive objective, RPO can adapt to a broad range of adaptive attacks, outperforming previous defenses.

The authors' experimental results on GPT-4 and Llama-2 demonstrate the potential of this approach, but also highlight the need for continued research to address the evolving threat landscape and the broader societal implications of these types of defenses. As the capabilities of LLMs continue to grow, developing robust and responsible defensive strategies will be crucial to ensuring the safe and beneficial deployment of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛠️

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks

Andy Zhou, Bo Li, Haohan Wang

Despite advances in AI alignment, large language models (LLMs) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries can modify prompts to induce unwanted behavior. While some defenses have been proposed, they have not been adapted to newly proposed attacks and more challenging threat models. To address this, we propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm, Robust Prompt Optimization (RPO) to create robust system-level defenses. Our approach directly incorporates the adversary into the defensive objective and optimizes a lightweight and transferable suffix, enabling RPO to adapt to worst-case adaptive attacks. Our theoretical and experimental results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting the state-of-the-art. Code can be found at https://github.com/lapisrocks/rpo

7/10/2024

💬

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at url{https://github.com/arobey1/smooth-llm}.

6/17/2024

New!Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs

Xueluan Gong, Mingzhe Li, Yilin Zhang, Fengyuan Ran, Chen Chen, Yanjiao Chen, Qian Wang, Kwok-Yan Lam

Large Language Models (LLMs) have excelled in various tasks but are still vulnerable to jailbreaking attacks, where attackers create jailbreak prompts to mislead the model to produce harmful or offensive content. Current jailbreak methods either rely heavily on manually crafted templates, which pose challenges in scalability and adaptability, or struggle to generate semantically coherent prompts, making them easy to detect. Additionally, most existing approaches involve lengthy prompts, leading to higher query costs.In this paper, to remedy these challenges, we introduce a novel jailbreaking attack framework, which is an automated, black-box jailbreaking attack framework that adapts the black-box fuzz testing approach with a series of customized designs. Instead of relying on manually crafted templates, our method starts with an empty seed pool, removing the need to search for any related jailbreaking templates. We also develop three novel question-dependent mutation strategies using an LLM helper to generate prompts that maintain semantic coherence while significantly reducing their length. Additionally, we implement a two-level judge module to accurately detect genuine successful jailbreaks. We evaluated our method on 7 representative LLMs and compared it with 5 state-of-the-art jailbreaking attack strategies. For proprietary LLM APIs, such as GPT-3.5 turbo, GPT-4, and Gemini-Pro, our method achieves attack success rates of over 90%, 80%, and 74%, respectively, exceeding existing baselines by more than 60%. Additionally, our method can maintain high semantic coherence while significantly reducing the length of jailbreak prompts. When targeting GPT-4, our method can achieve over 78% attack success rate even with 100 tokens. Moreover, our method demonstrates transferability and is robust to state-of-the-art defenses. We will open-source our codes upon publication.

9/24/2024

🔄

Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang

While Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to jailbreak attacks. Several primary defense strategies have been proposed to protect LLMs from producing harmful information, mostly with a particular focus on harmful content filtering or heuristical defensive prompt designs. However, how to achieve intrinsic robustness through the prompts remains an open problem. In this paper, motivated by adversarial training paradigms for achieving reliable robustness, we propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. To achieve our defense goal whilst maintaining natural performance, we optimize the control prompt with both adversarial and benign prompts. Comprehensive experiments show that our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0 while maintaining the model's utility on the benign task. The proposed defense strategy incurs only negligible computational overhead, charting a new perspective for future explorations in LLM security. Our code is available at https://github.com/rain152/PAT.

8/23/2024