DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Read original: arXiv:2408.11071 - Published 8/22/2024 by Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Qi Guo, Kaidi Xu

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Overview

A paper presenting DiffZOO, a black-box attack on text-to-image generative models using zeroth-order optimization.
The attack is "purely query-based," meaning it doesn't require access to the model's internals.
The paper includes a warning that the content may be offensive or upsetting.

Plain English Explanation

The paper describes a new technique called DiffZOO that can be used to attack text-to-image generative models in a "black-box" manner. This means the technique doesn't require any information about the internal workings of the model - it only needs to be able to query the model and get outputs.

The key idea behind DiffZOO is to use a mathematical optimization method called "zeroth-order optimization" to systematically find input prompts that will cause the model to generate images that are undesirable or harmful. This is a form of "red-teaming" - proactively testing the model's security and robustness.

The paper warns that the generated content may be offensive or upsetting, as the authors are trying to find the limits of the model's capabilities, including potentially generating harmful outputs.

Technical Explanation

The paper presents DiffZOO, a black-box attack on text-to-image generative models using zeroth-order optimization. Zeroth-order optimization is a technique that can optimize functions without accessing their internal gradients or structure.

The key innovation of DiffZOO is that it can attack text-to-image models in a purely query-based manner - it doesn't require knowledge of the model's internals, only the ability to provide input prompts and observe the generated outputs. This makes it a powerful red-teaming tool for evaluating the robustness of these models.

The authors demonstrate DiffZOO's effectiveness on several popular text-to-image models, showing that it can reliably generate undesirable or harmful outputs. They also provide insights into the vulnerabilities of these models and discuss potential countermeasures.

Critical Analysis

The paper provides a novel and important contribution to the field of text-to-image model security and robustness. By introducing a purely query-based black-box attack, the authors have expanded the set of tools available for red-teaming these models and understanding their failure modes.

However, the paper also raises some concerns. The authors acknowledge that the generated content may be offensive or upsetting, which highlights the potential for misuse of such techniques. Additionally, the paper does not delve into the ethical implications of this research or provide guidance on responsible disclosure and deployment.

Further research is needed to explore the broader implications of black-box attacks on text-to-image models, as well as to develop more robust defenses and responsible practices for model development and deployment.

Conclusion

The DiffZOO paper presents a novel black-box attack on text-to-image generative models using zeroth-order optimization. This technique can be a powerful tool for evaluating the security and robustness of these models, but also raises ethical concerns about the potential for misuse. Ongoing research and responsible practices will be crucial to ensuring the safe and beneficial development of text-to-image technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Qi Guo, Kaidi Xu

Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of textit{purely black-box} attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.

8/22/2024

RT-Attack: Jailbreaking Text-to-Image Models via Random Token

Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Liu, Qing Guo

Recently, Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While some black-box jailbreak attacks have been explored, these typically rely on simply replacing sensitive words, leading to suboptimal attack performance. To address this issue, we introduce a two-stage query-based black-box attack method utilizing random search. In the first stage, we establish a preliminary prompt by maximizing the semantic similarity between the adversarial and target harmful prompts. In the second stage, we use this initial prompt to refine our approach, creating a detailed adversarial prompt aimed at jailbreaking and maximizing the similarity in image features between the images generated from this prompt and those produced by the target harmful prompt. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

8/28/2024

🌿

Certified Zeroth-order Black-Box Defense with Robust UNet Denoiser

Astha Verma, A V Subramanyam, Siddhesh Bangar, Naman Lal, Rajiv Ratn Shah, Shin'ichi Satoh

Certified defense methods against adversarial perturbations have been recently investigated in the black-box setting with a zeroth-order (ZO) perspective. However, these methods suffer from high model variance with low performance on high-dimensional datasets due to the ineffective design of the denoiser and are limited in their utilization of ZO techniques. To this end, we propose a certified ZO preprocessing technique for removing adversarial perturbations from the attacked image in the black-box setting using only model queries. We propose a robust UNet denoiser (RDUNet) that ensures the robustness of black-box models trained on high-dimensional datasets. We propose a novel black-box denoised smoothing (DS) defense mechanism, ZO-RUDS, by prepending our RDUNet to the black-box model, ensuring black-box defense. We further propose ZO-AE-RUDS in which RDUNet followed by autoencoder (AE) is prepended to the black-box model. We perform extensive experiments on four classification datasets, CIFAR-10, CIFAR-10, Tiny Imagenet, STL-10, and the MNIST dataset for image reconstruction tasks. Our proposed defense methods ZO-RUDS and ZO-AE-RUDS beat SOTA with a huge margin of $35%$ and $9%$, for low dimensional (CIFAR-10) and with a margin of $20.61%$ and $23.51%$ for high-dimensional (STL-10) datasets, respectively.

7/9/2024

Adversarial Robustification via Text-to-Image Diffusion Models

Daewon Choi, Jongheon Jeong, Huiwon Jang, Jinwoo Shin

Adversarial robustness has been conventionally believed as a challenging property to encode for neural networks, requiring plenty of training data. In the recent paradigm of adopting off-the-shelf models, however, access to their training data is often infeasible or not practical, while most of such models are not originally trained concerning adversarial robustness. In this paper, we develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data. Our intuition is to view recent text-to-image diffusion models as adaptable denoisers that can be optimized to specify target tasks. Based on this, we propose: (a) to initiate a denoise-and-classify pipeline that offers provable guarantees against adversarial attacks, and (b) to leverage a few synthetic reference images generated from the text-to-image model that enables novel adaptation schemes. Our experiments show that our data-free scheme applied to the pre-trained CLIP could improve the (provable) adversarial robustness of its diverse zero-shot classification derivatives (while maintaining their accuracy), significantly surpassing prior approaches that utilize the full training data. Not only for CLIP, we also demonstrate that our framework is easily applicable for robustifying other visual classifiers efficiently.

7/29/2024