RT-Attack: Jailbreaking Text-to-Image Models via Random Token

Read original: arXiv:2408.13896 - Published 8/28/2024 by Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, Qing Guo

RT-Attack: Jailbreaking Text-to-Image Models via Random Token

Overview

This paper presents RT-Attack, a new method for attacking text-to-image (T2I) models by injecting random tokens into the input text.
The goal is to "jailbreak" the T2I model, causing it to generate unintended and potentially inappropriate images.
The approach is evaluated on several popular T2I models, demonstrating its effectiveness in bypassing the models' content filtering and safety mechanisms.

Plain English Explanation

The paper introduces a new technique called RT-Attack that can be used to [object Object] of text-to-image (T2I) models. T2I models are AI systems that can generate images from text descriptions.

The key idea behind RT-Attack is to [object Object] into the text prompt given to the T2I model. This can cause the model to generate images that are very different from what the original prompt intended.

For example, if you asked a T2I model to generate an image of a cat, RT-Attack could inject random words into the prompt, tricking the model into generating an entirely different and potentially inappropriate image. This is like [object Object] the T2I model, allowing it to bypass the safety and content restrictions built into the system.

The paper evaluates RT-Attack on several popular T2I models, showing that it can effectively [object Object] and generate a wide range of unintended images. This highlights the potential security risks of these models and the need for more robust safety mechanisms.

Technical Explanation

The paper introduces a new attack method called RT-Attack, which targets text-to-image (T2I) generative models. The key idea is to [object Object] into the input text prompt to cause the T2I model to generate unintended and potentially inappropriate images.

The authors evaluate RT-Attack on several popular T2I models, including DALL-E 2, Midjourney, and Stable Diffusion. They show that by [object Object] to the input prompt, they can bypass the models' content filtering and safety mechanisms, causing them to generate a wide range of undesirable images.

The paper provides a detailed analysis of the attack approach, including experiments to understand the [object Object] and the vulnerability of various T2I models to this type of attack.

Critical Analysis

The paper raises important concerns about the security and robustness of current T2I models. The [object Object] demonstrates that these models can be relatively easily "jailbroken" by injecting random tokens into the input text, bypassing the content filtering and safety mechanisms.

While the authors acknowledge that T2I models have made significant progress in recent years, the vulnerability exposed by RT-Attack highlights the need for [object Object] in these systems. Potential malicious actors could exploit this weakness to generate harmful or inappropriate content, which could have serious consequences.

The paper also [object Object] of the RT-Attack approach, such as the potential for misuse or unintended consequences. Further research is needed to understand the broader societal impacts of this type of attack and to develop more effective safeguards against such threats.

Conclusion

The RT-Attack paper presents a novel method for bypassing the content restrictions of text-to-image (T2I) generative models. By injecting random tokens into the input text, the authors demonstrate that it is possible to [object Object].

This research highlights the need for more robust safety mechanisms and security measures in T2I systems, as the [object Object]. As these models become more widely adopted, it will be crucial for developers and researchers to address these security concerns and ensure the responsible and ethical deployment of these powerful technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RT-Attack: Jailbreaking Text-to-Image Models via Random Token

Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, Qing Guo

Recently, Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While some black-box jailbreak attacks have been explored, these typically rely on simply replacing sensitive words, leading to suboptimal attack performance. To address this issue, we introduce a two-stage query-based black-box attack method utilizing random search. In the first stage, we establish a preliminary prompt by maximizing the semantic similarity between the adversarial and target harmful prompts. In the second stage, we use this initial prompt to refine our approach, creating a detailed adversarial prompt aimed at jailbreaking and maximizing the similarity in image features between the images generated from this prompt and those produced by the target harmful prompt. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

8/28/2024

Perception-guided Jailbreak against Text-to-Image Models

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, Yang Liu

In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

8/27/2024

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang

Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.

5/29/2024

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma, Anda Cao, Zhiqing Xiao, Yijiang Li, Jie Zhang, Chao Ye, Junbo Zhao

Text-to-image (T2I) models can be maliciously used to generate harmful content such as sexually explicit, unfaithful, and misleading or Not-Safe-for-Work (NSFW) images. Previous attacks largely depend on the availability of the diffusion model or involve a lengthy optimization process. In this work, we investigate a more practical and universal attack that does not require the presence of a target model and demonstrate that the high-dimensional text embedding space inherently contains NSFW concepts that can be exploited to generate harmful images. We present the Jailbreaking Prompt Attack (JPA). JPA first searches for the target malicious concepts in the text embedding space using a group of antonyms generated by ChatGPT. Subsequently, a prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space. We further introduce a soft assignment with gradient masking technique that allows us to perform gradient ascent in the discrete vocabulary space. We perform extensive experiments with open-sourced T2I models, e.g. stable-diffusion-v1-4 and closed-sourced online services, e.g. DALLE2, Midjourney with black-box safety checkers. Results show that (1) JPA bypasses both text and image safety checkers (2) while preserving high semantic alignment with the target prompt. (3) JPA demonstrates a much faster speed than previous methods and can be executed in a fully automated manner. These merits render it a valuable tool for robustness evaluation in future text-to-image generation research.

9/5/2024