Dark Miner: Defend against unsafe generation for text-to-image diffusion models

Read original: arXiv:2409.17682 - Published 9/27/2024 by Zheling Meng, Bo Peng, Xiaochuan Jin, Yue Jiang, Jing Dong, Wei Wang, Tieniu Tan

Dark Miner: Defend against unsafe generation for text-to-image diffusion models

Overview

The paper proposes a framework called "Dark Miner" to defend against unsafe generation in text-to-image diffusion models.
It focuses on mitigating the generation of potentially harmful or explicit content.
The approach involves adversarial training and safe prompting to steer the model away from unsafe outputs.

Plain English Explanation

Text-to-image models are powerful tools that can generate images from textual descriptions. However, there is a risk that these models could be used to create inappropriate or harmful content, such as sexually explicit images or violent scenes.

The researchers developed a framework called Dark Miner to address this issue. The key idea is to train the model in a way that discourages it from generating unsafe content, while still allowing it to create a wide range of other images.

The approach involves two main techniques:

Adversarial Training: The model is trained not just on the target task of generating images, but also on a secondary task of detecting whether the generated images are safe or unsafe. This helps the model learn to avoid producing unsafe outputs.
Safe Prompting: The researchers also developed a system to analyze text prompts and identify ones that are likely to lead to unsafe image generation. The model is then instructed to avoid generating images for these prompts.

By using these techniques, the Dark Miner framework can help ensure that text-to-image models are used responsibly and do not produce harmful content.

Technical Explanation

The Dark Miner framework consists of two main components:

Adversarial Training: The researchers train the text-to-image model using an adversarial approach. In addition to the standard training objective of generating high-quality images, the model is also trained to classify the generated images as either "safe" or "unsafe". This adversarial training encourages the model to learn features that distinguish safe and unsafe images, and ultimately to avoid generating unsafe content.
Safe Prompting: The researchers also develop a system to analyze the text prompts used to generate images. They train a separate model to predict whether a given prompt is likely to result in unsafe image generation. The text-to-image model is then instructed to avoid generating images for prompts that are flagged as unsafe by this prompt classifier.

The researchers evaluate the Dark Miner framework on several benchmark datasets and find that it is effective at reducing the generation of unsafe images without significantly impacting the model's overall performance on the target task.

Critical Analysis

The Dark Miner framework represents a promising approach to mitigating the potential harms of text-to-image models. However, the paper does acknowledge some limitations and areas for further research:

The effectiveness of the framework may be dependent on the quality and coverage of the training data used to identify unsafe content. Expanding the range of unsafe content types considered could be an important area for improvement.
The prompt classification model may not be able to reliably identify all potentially unsafe prompts, particularly if users try to intentionally obfuscate or disguise their intent.
There may be ethical and legal considerations around the use of adversarial training and prompt filtering, as these techniques could potentially restrict free expression or access to information.

Addressing these concerns and continuing to refine the Dark Miner approach will be important for ensuring that text-to-image models are developed and deployed responsibly.

Conclusion

The Dark Miner framework represents an important step forward in addressing the challenge of unsafe content generation in text-to-image models. By combining adversarial training and safe prompting, the approach can help steer these powerful models away from creating harmful or explicit outputs.

While the framework has some limitations, it demonstrates the potential for developing technical solutions to mitigate the risks associated with generative AI systems. As the field of text-to-image modeling continues to advance, frameworks like Dark Miner will be crucial for ensuring these technologies are deployed responsibly and in a way that benefits society.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Dark Miner: Defend against unsafe generation for text-to-image diffusion models

Zheling Meng, Bo Peng, Xiaochuan Jin, Yue Jiang, Jing Dong, Wei Wang, Tieniu Tan

Text-to-image diffusion models have been demonstrated with unsafe generation due to unfiltered large-scale training data, such as violent, sexual, and shocking images, necessitating the erasure of unsafe concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing unsafe descriptions. However, they fail to guarantee safe generation for unseen texts in the training phase, especially for the prompts from adversarial attacks. In this paper, we re-analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of unsafe generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. It greedily mines embeddings with maximum generation probabilities of unsafe concepts and reduces unsafe generation more effectively. In the experiments, we evaluate its performance on two inappropriate concepts, two objects, and two styles. Compared with 6 previous state-of-the-art methods, our method achieves better erasure and defense results in most cases, especially under 4 state-of-the-art attacks, while preserving the model's native generation capability. Our code will be available on GitHub.

9/27/2024

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu

Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexually explicit scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block sexually explicit content (e.g., naked) but may still be vulnerable to adversarial prompts -- inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate sexual content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate explicit visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since such unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets and large-scale user studies demonstrate SafeGen's effectiveness in mitigating sexually explicit content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.4% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.

9/17/2024

🤔

Towards Understanding Unsafe Video Generation

Yan Pang, Aiping Xiong, Yang Zhang, Tianhao Wang

Video generation models (VGMs) have demonstrated the capability to synthesize high-quality output. It is important to understand their potential to produce unsafe content, such as violent or terrifying videos. In this work, we provide a comprehensive understanding of unsafe video generation. First, to confirm the possibility that these models could indeed generate unsafe videos, we choose unsafe content generation prompts collected from 4chan and Lexica, and three open-source SOTA VGMs to generate unsafe videos. After filtering out duplicates and poorly generated content, we created an initial set of 2112 unsafe videos from an original pool of 5607 videos. Through clustering and thematic coding analysis of these generated videos, we identify 5 unsafe video categories: Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, and Political. With IRB approval, we then recruit online participants to help label the generated videos. Based on the annotations submitted by 403 participants, we identified 937 unsafe videos from the initial video set. With the labeled information and the corresponding prompts, we created the first dataset of unsafe videos generated by VGMs. We then study possible defense mechanisms to prevent the generation of unsafe videos. Existing defense methods in image generation focus on filtering either input prompt or output results. We propose a new approach called Latent Variable Defense (LVD), which works within the model's internal sampling process. LVD can achieve 0.90 defense accuracy while reducing time and computing resources by 10x when sampling a large number of unsafe prompts.

7/18/2024

Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines. Code and data will be shared at https://latentguard.github.io/.

8/20/2024