Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation

2405.13068

Published 6/21/2024 by Yuxi Li, Yi Liu, Yuekang Li, Ling Shi, Gelei Deng, Shengquan Chen, Kailong Wang

📊

Abstract

Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated mining process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.

Create account to get full access

Overview

Large language models (LLMs) have transformed natural language processing, but they remain vulnerable to jailbreaking attacks
Existing token-level jailbreaking techniques face scalability and efficiency challenges as models are updated and incorporate new defenses
This paper introduces JailMine, an innovative token-level manipulation approach to address these limitations

Plain English Explanation

Large language models (LLMs) are powerful artificial intelligence systems that can understand and generate human-like text. While LLMs have revolutionized fields like natural language processing, they are also vulnerable to a type of attack called "jailbreaking."

Jailbreaking is when someone finds a way to make an LLM produce unintended and potentially harmful content, even when the system is designed to avoid that. Current jailbreaking techniques work by carefully modifying the text inputs to the LLM. However, these techniques can be time-consuming and may not work as well when the LLM is updated with new defenses.

The researchers in this paper introduce a new approach called "JailMine" that addresses these limitations. JailMine uses an automated process to find ways to elicit malicious responses from LLMs, even as the models evolve to prevent such attacks. The researchers show that JailMine is both effective and efficient, reducing the time needed to jailbreak an LLM by 86% on average while maintaining a high success rate of 95%.

This work highlights the ongoing challenge of ensuring the security and reliability of powerful language models, and the importance of continued research to mitigate the risks of jailbreaking attacks.

Technical Explanation

The paper introduces JailMine, a novel token-level manipulation approach to jailbreaking large language models (LLMs). Existing jailbreaking techniques, while effective, face scalability and efficiency challenges as models undergo frequent updates and incorporate advanced defensive measures.

JailMine employs an automated mining process to strategically select affirmative outputs from the LLM and iteratively reduce the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, the researchers demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies.

The paper makes several key contributions:

Automated Mining Process: JailMine introduces an innovative automated mining process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection.
Improved Efficiency and Effectiveness: The researchers show that JailMine significantly outperforms existing token-level jailbreaking techniques, reducing the time needed by 86% on average while maintaining a high success rate of 95%.
Robustness to Evolving Defenses: JailMine demonstrates its ability to effectively jailbreak LLMs even as the models incorporate advanced defensive measures, highlighting the system's adaptability and resilience.

The paper's findings contribute to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.

Critical Analysis

The paper presents a compelling solution to the challenge of jailbreaking large language models (LLMs) in the face of evolving defensive strategies. The researchers' introduction of the JailMine system, with its automated mining process and improved efficiency, is a significant advancement in the field.

However, the paper does acknowledge some limitations and areas for further research. For example, the researchers note that JailMine's effectiveness may be influenced by the specific LLM and dataset used, and that further testing across a wider range of models and scenarios would be valuable.

Additionally, while the paper demonstrates JailMine's ability to jailbreak LLMs that have incorporated advanced defensive measures, it does not provide a comprehensive analysis of the long-term viability of the approach. As language models continue to evolve, it will be critical to assess whether JailMine can maintain its effectiveness in the face of increasingly sophisticated defensive strategies.

Overall, the paper makes a significant contribution to the field of LLM security, but further research and ongoing vigilance will be necessary to ensure the continued reliability and safety of these powerful language models.

Conclusion

This paper introduces JailMine, an innovative token-level manipulation approach that addresses the limitations of existing jailbreaking techniques for large language models (LLMs). Through rigorous testing, the researchers demonstrate JailMine's effectiveness and efficiency, achieving a significant reduction in time consumed while maintaining high success rates, even as LLMs incorporate advanced defensive measures.

The work highlights the ongoing challenge of ensuring the security and reliability of powerful language models, and the importance of continued research to mitigate the risks of jailbreaking attacks. As LLMs continue to evolve and play an increasingly influential role in our lives, it will be crucial to develop robust and adaptive safeguards to protect against the potential misuse of these transformative technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek

Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of jailbreaking, where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.

5/20/2024

cs.CR cs.AI

🌀

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Zhao Xu, Fan Liu, Hao Liu

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we evaluate the impact of various attack settings on LLM performance and provide a baseline benchmark for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 320 experiments with about 50,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/Bag_of_Tricks_for_LLM_Jailbreaking.

6/14/2024

cs.CR cs.AI cs.CL

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

4/15/2024

cs.CR cs.AI cs.CL

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token ``Sure''), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.

6/19/2024

cs.CR cs.AI cs.LG stat.ML