Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

2404.19382

Published 5/1/2024 by Xiaoxuan Han, Songlin Yang, Wei Wang, Yang Li, Jing Dong

🧪

Abstract

Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow concepts from unlearned models, such as celebrity identity. Therefore, this paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting. This challenging scenario assumes that the unlearning method is unknown and the unlearned model is inaccessible for optimization, requiring the attack to be capable of transferring across different unlearned models. Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models. This strategy adopts the original Stable Diffusion model as a surrogate model to iteratively erase and search for embeddings, enabling it to find the embedding that can restore the target concept for different unlearning methods. Extensive experiments demonstrate the transferability of the searched adversarial embedding across several state-of-the-art unlearning methods and its effectiveness for different levels of concepts.

Create account to get full access

Overview

Advanced text-to-image diffusion models raise concerns about identity privacy violation, copyright infringement, and generating Not Safe For Work (NSFW) content
Unlearning methods have been developed to erase these problematic concepts from diffusion models, but they only shift the text-to-image mapping and preserve the visual content, leaving a flaw for restoring the erased concepts
This paper aims to leverage the transferability of adversarial attacks to probe the unlearning robustness under a black-box setting, where the unlearning method and model are unknown

Plain English Explanation

The rapid development of advanced text-to-image diffusion models has raised significant safety concerns. These models can be used to generate images that violate an individual's privacy, infringe on copyrights, or produce content that is not suitable for all audiences. In response, researchers have developed "unlearning" methods to remove these problematic concepts from the diffusion models. However, these unlearning techniques only shift the way the model maps text to images, while still preserving the visual content. This means that the erased concepts can still be restored, which is a major flaw in the unlearning process.

To address this issue, the researchers in this paper leverage the transferability of adversarial attacks to probe the unlearning robustness under a black-box setting, where the unlearning method and model are unknown. This is a challenging scenario, as the attack needs to be able to transfer across different unlearned models to effectively restore the erased concepts, such as celebrity identities or other narrow concepts.

Technical Explanation

The paper proposes an adversarial search strategy that uses the original Stable Diffusion model as a surrogate to iteratively erase and search for embeddings that can effectively restore the target concept across different unlearned models. This approach is designed to address the lack of transferability and limited attack capabilities of previous methods.

Through extensive experiments, the researchers demonstrate the transferability of the searched adversarial embedding across several state-of-the-art unlearning methods, as well as its effectiveness in restoring different levels of concepts, from narrow concepts to more general visual content.

Critical Analysis

The paper provides a comprehensive and rigorous investigation of the "erasure trustworthiness problem" in text-to-image diffusion models, which is a critical issue that must be addressed as these models become more advanced and widely deployed. The authors' approach of leveraging adversarial attacks to probe the unlearning robustness under a black-box setting is novel and promising.

However, the paper does not fully explore the potential long-term implications of these adversarial attacks on the development and deployment of text-to-image diffusion models. Additionally, the paper does not discuss the ethical considerations and responsibilities of researchers and developers in addressing these safety concerns.

Further research is needed to understand the broader societal impact of these models and to develop more robust and responsible unlearning methods that can withstand a wide range of adversarial attacks.

Conclusion

This paper presents a novel approach to probing the unlearning robustness of advanced text-to-image diffusion models, which is a critical issue as these models become more sophisticated and widely used. The authors' adversarial search strategy demonstrates the transferability of adversarial embeddings across different unlearning methods, highlighting the need for more robust and responsible approaches to addressing the safety concerns raised by these models. The findings of this research have important implications for the continued development and deployment of text-to-image diffusion models, and will likely spur further investigation into the complex challenges surrounding identity privacy, copyright infringement, and NSFW content generation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu

Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at: https://github.com/OPTML-Group/AdvUnlearn

6/18/2024

cs.CV cs.CR

📉

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu

The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic classification abilities of DMs to simplify the creation of adversarial prompts, thereby eliminating the need for auxiliary classification or diffusion models.Through extensive benchmarking, we evaluate the robustness of five widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. Our results demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the state-of-the-art adversarial prompt generation method and reveal the lack of robustness of current safety-driven unlearning techniques when applied to DMs. Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: This paper contains model outputs that may be offensive in nature.

6/18/2024

cs.CV

A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models

Rui Ma, Qiang Zhou, Yizhu Jin, Daquan Zhou, Bangjun Xiao, Xiuyu Li, Yi Qu, Aishani Singh, Kurt Keutzer, Jingtong Hu, Xiaodong Xie, Zhen Dong, Shanghang Zhang, Shiji Zhou

Copyright law confers upon creators the exclusive rights to reproduce, distribute, and monetize their creative works. However, recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement. These technologies enable the unauthorized learning and replication of copyrighted content, artistic creations, and likenesses, leading to the proliferation of unregulated content. Notably, models like stable diffusion, which excel in text-to-image synthesis, heighten the risk of copyright infringement and unauthorized distribution.Machine unlearning, which seeks to eradicate the influence of specific data or concepts from machine learning models, emerges as a promising solution by eliminating the enquote{copyright memories} ingrained in diffusion models. Yet, the absence of comprehensive large-scale datasets and standardized benchmarks for evaluating the efficacy of unlearning techniques in the copyright protection scenarios impedes the development of more effective unlearning methods. To address this gap, we introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset. This dataset encompasses anchor images, associated prompts, and images synthesized by text-to-image models. Additionally, we have developed a mixed metric based on semantic and style information, validated through both human and artist assessments, to gauge the effectiveness of unlearning approaches. Our dataset, benchmark library, and evaluation metrics will be made publicly available to foster future research and practical applications (https://rmpku.github.io/CPDM-page/, website / http://149.104.22.83/unlearning.tar.gz, dataset).

6/24/2024

cs.CV

📊

Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation

Zhengyue Zhao, Jinhao Duan, Xing Hu, Kaidi Xu, Chenan Wang, Rui Zhang, Zidong Du, Qi Guo, Yunji Chen

Diffusion models have demonstrated remarkable performance in image generation tasks, paving the way for powerful AIGC applications. However, these widely-used generative models can also raise security and privacy concerns, such as copyright infringement, and sensitive data leakage. To tackle these issues, we propose a method, Unlearnable Diffusion Perturbation, to safeguard images from unauthorized exploitation. Our approach involves designing an algorithm to generate sample-wise perturbation noise for each image to be protected. This imperceptible protective noise makes the data almost unlearnable for diffusion models, i.e., diffusion models trained or fine-tuned on the protected data cannot generate high-quality and diverse images related to the protected training data. Theoretically, we frame this as a max-min optimization problem and introduce EUDP, a noise scheduler-based method to enhance the effectiveness of the protective noise. We evaluate our methods on both Denoising Diffusion Probabilistic Model and Latent Diffusion Models, demonstrating that training diffusion models on the protected data lead to a significant reduction in the quality of the generated images. Especially, the experimental results on Stable Diffusion demonstrate that our method effectively safeguards images from being used to train Diffusion Models in various tasks, such as training specific objects and styles. This achievement holds significant importance in real-world scenarios, as it contributes to the protection of privacy and copyright against AI-generated content.

6/26/2024

cs.CV