Perception-guided Jailbreak against Text-to-Image Models

Read original: arXiv:2408.10848 - Published 8/27/2024 by Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu

Perception-guided Jailbreak against Text-to-Image Models

Overview

This paper explores a technique called "perception-guided jailbreak" to bypass the content restrictions of text-to-image models.
The researchers demonstrate how perception-based prompts can be used to generate NSFW (Not-Safe-For-Work) images from ostensibly safe text-to-image models.
The findings highlight the challenges in controlling the outputs of these powerful generative models and the need for further research into robust safety and security measures.

Plain English Explanation

Text-to-image models are AI systems that can generate images based on textual descriptions. These models are often trained to avoid producing inappropriate or harmful content, a process known as "jailbreaking."

The researchers in this paper discovered a way to circumvent these jailbreaking measures by using "perception-based" prompts. Instead of directly asking the model to generate NSFW content, they found that they could prompt the model with descriptions that would lead it to generate the desired (but restricted) content.

For example, rather than asking the model to "create a nude image," they might prompt it with something like "a person with no clothes on." The model would then interpret this prompt through its perception of the world and generate an NSFW image, despite the model's restrictions.

This technique highlights a flaw in the way many text-to-image models are currently designed. While they may be able to detect and block direct requests for inappropriate content, they can still be manipulated to generate that content indirectly. Addressing this challenge will require more sophisticated safety measures and a deeper understanding of how these models perceive and interpret language.

Technical Explanation

The paper presents a "perception-guided jailbreak" technique to bypass the safety measures of text-to-image models. The core idea is to leverage the model's perception of the world, as encoded in its training data and parameters, to generate NSFW content without directly asking for it.

The researchers conducted experiments with several state-of-the-art text-to-image models, including DALL-E 2, Midjourney, and Stable Diffusion. They crafted a set of "perception-based" prompts that, while not directly requesting NSFW content, would lead the models to generate such images.

For example, prompts like "a person with no clothes on" or "a sensual photograph of a naked body" were found to be effective in circumventing the models' jailbreaking mechanisms. The researchers hypothesize that these prompts align with the models' internal representations of the world, causing them to interpret the request as a legitimate one and generate the corresponding NSFW content.

The paper also explores the role of prompt engineering, showing how small changes in wording can significantly impact the generated images. Additionally, the researchers investigate the transferability of their perception-guided jailbreak approach, demonstrating its effectiveness across multiple text-to-image models.

Critical Analysis

The paper raises important concerns about the limitations of current text-to-image model safety measures. While these models may be able to detect and block direct requests for inappropriate content, the researchers have shown that they can still be manipulated to generate such content indirectly.

One potential limitation of the study is the reliance on a relatively small set of perception-based prompts. It's possible that a more comprehensive exploration of prompt variations could lead to even more effective jailbreaking techniques. Additionally, the paper does not address the potential for adversarial prompts that could exploit other vulnerabilities in the models.

Another area for further research is the impact of model architecture, training data, and fine-tuning on the effectiveness of perception-guided jailbreaking. It's possible that some models may be more resilient to these types of attacks than others, and understanding the factors that contribute to this resilience could inform the development of more robust safety measures.

Overall, the findings of this paper highlight the need for continued research and innovation in the development of secure and trustworthy text-to-image models. As these models become more powerful and ubiquitous, ensuring that they can be reliably controlled and constrained will be critical to their responsible deployment and use.

Conclusion

This paper presents a novel "perception-guided jailbreak" technique that can be used to bypass the content restrictions of text-to-image models. The researchers demonstrate how carefully crafted prompts, which align with the models' internal representations of the world, can lead to the generation of NSFW content despite the models' built-in safety measures.

The findings underscore the challenges in developing effective and comprehensive jailbreaking mechanisms for these powerful generative models. As text-to-image systems continue to advance, addressing vulnerabilities like the one highlighted in this paper will be crucial to ensuring their safe and responsible deployment.

The insights from this research can inform the development of more robust safety and security measures for text-to-image models, as well as the broader field of AI safety and alignment. Continued exploration of these issues will be essential as generative AI systems become increasingly pervasive and influential in our daily lives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Perception-guided Jailbreak against Text-to-Image Models

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu

In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

8/27/2024

RT-Attack: Jailbreaking Text-to-Image Models via Random Token

Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, Qing Guo

Recently, Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While some black-box jailbreak attacks have been explored, these typically rely on simply replacing sensitive words, leading to suboptimal attack performance. To address this issue, we introduce a two-stage query-based black-box attack method utilizing random search. In the first stage, we establish a preliminary prompt by maximizing the semantic similarity between the adversarial and target harmful prompts. In the second stage, we use this initial prompt to refine our approach, creating a detailed adversarial prompt aimed at jailbreaking and maximizing the similarity in image features between the images generated from this prompt and those produced by the target harmful prompt. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

8/28/2024

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang

Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.

5/29/2024

Jailbreaking Text-to-Image Models with LLM-Based Agents

Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, Shanqing Guo

Recent advancements have significantly improved automated task-solving capabilities using autonomous agents powered by large language models (LLMs). However, most LLM-based agents focus on dialogue, programming, or specialized domains, leaving their potential for addressing generative AI safety tasks largely unexplored. In this paper, we propose Atlas, an advanced LLM-based multi-agent framework targeting generative AI models, specifically focusing on jailbreak attacks against text-to-image (T2I) models with built-in safety filters. Atlas consists of two agents, namely the mutation agent and the selection agent, each comprising four key modules: a vision-language model (VLM) or LLM brain, planning, memory, and tool usage. The mutation agent uses its VLM brain to determine whether a prompt triggers the T2I model's safety filter. It then collaborates iteratively with the LLM brain of the selection agent to generate new candidate jailbreak prompts with the highest potential to bypass the filter. In addition to multi-agent communication, we leverage in-context learning (ICL) memory mechanisms and the chain-of-thought (COT) approach to learn from past successes and failures, thereby enhancing Atlas's performance. Our evaluation demonstrates that Atlas successfully jailbreaks several state-of-the-art T2I models equipped with multi-modal safety filters in a black-box setting. Additionally, Atlas outperforms existing methods in both query efficiency and the quality of generated images. This work convincingly demonstrates the successful application of LLM-based agents in studying the safety vulnerabilities of popular text-to-image generation models. We urge the community to consider advanced techniques like ours in response to the rapidly evolving text-to-image generation field.

9/10/2024