Hidden You Malicious Goal Into Benigh Narratives: Jailbreak Large Language Models through Logic Chain Injection

2404.04849

Published 4/9/2024 by Zhilong Wang, Yebo Cao, Peng Liu

Hidden You Malicious Goal Into Benigh Narratives: Jailbreak Large Language Models through Logic Chain Injection

Abstract

Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.

Get summaries of the top AI research delivered straight to your inbox:

Background

LLM Jailbreak Attack

Large language models (LLMs) are powerful AI systems trained on vast amounts of text data to generate human-like language. However, these models can sometimes be "jailbroken" - manipulated to bypass the safety constraints and ethical principles they were designed with. This can allow the models to produce harmful or undesirable content that goes against their intended purpose.

Some examples of LLM jailbreak attacks include tricking the model into generating violent or hateful speech, or instructing it to help with illegal activities. Researchers have also demonstrated how simple prompts can be used to "jailbreak" even the most safety-conscious LLMs.

The ability to jailbreak LLMs is a serious concern, as it could allow bad actors to misuse these powerful AI systems for nefarious purposes. Developing robust defenses against such attacks is an active area of research, as seen in efforts like the JailbreakBench benchmark and the JailbreakV 28K benchmark.

Plain English Explanation

Large language models (LLMs) are advanced AI systems that can generate human-like text. However, they can sometimes be "tricked" or manipulated to bypass the safety measures and ethical principles they were designed with. This is known as a "jailbreak" attack.

In a jailbreak attack, someone might trick an LLM into generating harmful or undesirable content, like violent or hateful speech, or instructions for illegal activities. Researchers have shown that even the most safety-conscious LLMs can be jailbroken using simple prompts.

The ability to jailbreak LLMs is a serious concern, as it could allow bad actors to misuse these powerful AI systems for harmful purposes. Researchers are working on developing stronger defenses against such attacks, but it remains an ongoing challenge.

Technical Explanation

The paper focuses on the problem of "jailbreaking" large language models (LLMs) - that is, finding ways to bypass the safety constraints and ethical principles that these models are designed with. The authors propose a new attack technique called "Logic Chain Injection" (LCI), which allows them to inject malicious goals into the LLM's reasoning process while maintaining a benign narrative.

The key idea behind LCI is to construct a logical chain of reasoning that starts with an innocuous premise and gradually steers the LLM towards an undesirable output. The authors demonstrate the effectiveness of LCI through a series of experiments, showing how it can be used to jailbreak even the most safety-aligned LLMs.

The paper also introduces two new benchmarks for evaluating the robustness of LLMs against jailbreak attacks: JailbreakBench and JailbreakV 28K. These benchmarks test the models' ability to resist a wide range of jailbreak techniques, including LCI.

Critical Analysis

The research presented in this paper highlights a significant vulnerability in large language models, which could have serious implications if exploited by bad actors. The authors' logic chain injection technique is particularly concerning, as it demonstrates how LLMs can be manipulated to produce harmful content while maintaining a veneer of benignity.

One limitation of the work is that it focuses primarily on text-based attacks, and does not address the possibility of jailbreak attacks in multimodal LLMs that can process and generate other types of media, such as images or videos. The authors acknowledge this as an area for future research.

Additionally, while the benchmarks introduced in the paper are a valuable contribution, it remains to be seen how effective they will be in capturing the full range of possible jailbreak attacks. As the field of AI security continues to evolve, new and more sophisticated attack vectors may emerge that these benchmarks do not adequately address.

Overall, the research in this paper underscores the importance of developing robust defenses against jailbreak attacks, as the consequences of such attacks could be severe if left unchecked. Continued vigilance and a commitment to responsible AI development will be crucial in mitigating these risks.

Conclusion

The paper presents a concerning vulnerability in large language models, demonstrating how they can be "jailbroken" through the use of logic chain injection techniques. This allows malicious actors to bypass the safety constraints and ethical principles that these models are designed with, potentially enabling the generation of harmful or undesirable content.

The authors' work highlights the need for continued research and development in the field of AI security, as the ability to jailbreak LLMs poses a significant threat. The introduction of the JailbreakBench and JailbreakV 28K benchmarks is a valuable contribution, but more work is needed to fully address the evolving landscape of jailbreak attacks, particularly in the context of multimodal LLMs.

Ultimately, the findings in this paper underscore the importance of responsible AI development and the need for a strong, multifaceted approach to ensuring the safety and security of these powerful systems. As the field of AI continues to advance, vigilance and a commitment to ethical principles will be crucial in mitigating the risks posed by jailbreak attacks and other emerging threats.

Related Papers

Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts

Tianyu Zhang, Zixuan Zhao, Jiaqi Huang, Jingyu Hua, Sheng Zhong

As Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention, it is of great significance to raise a generalized research paradigm to evaluate attack strengths and a basic model to conduct subtler experiments. In this paper, we propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts, aiming to circumvent the limitations posed by enhanced LLM security. Through designing and analyzing these sensitive questions, this paper reveals a more effective method of identifying vulnerabilities in LLMs, thereby contributing to the advancement of LLM security. This research not only challenges existing jailbreaking methodologies but also fortifies LLMs against potential exploits.

4/15/2024

cs.CR cs.AI cs.CL

🤷

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

Shang Shang, Xinqiang Zhao, Zhongjiang Yao, Yepeng Yao, Liya Su, Zijing Fan, Xiaodan Zhang, Zhengwei Jiang

To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: Obscure Intention and Create Ambiguity, which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.

5/8/2024

cs.CR cs.AI

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

4/16/2024

cs.CR cs.CL cs.HC

💬

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM.

4/9/2024

cs.CL