SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Read original: arXiv:2404.06666 - Published 9/17/2024 by Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Overview

This paper proposes "SafeGen", a framework to mitigate unsafe content generation in text-to-image models.
The key idea is to identify and filter out unsafe prompts during the image generation process.
The authors explore different approaches, including using pre-trained classifiers to detect unsafe prompts and fine-tuning models to avoid generating unsafe content.

Plain English Explanation

Text-to-image models, like DALL-E and Stable Diffusion, are powerful tools that can create images from textual descriptions. However, these models can also be used to generate harmful or unethical content, such as explicit or violent imagery.

The SafeGen framework aims to address this issue by detecting and filtering out unsafe prompts before the image is generated. The key idea is to use machine learning models to identify prompts that are likely to result in unsafe content, and then block those prompts from being processed by the text-to-image model.

The researchers explore different approaches to achieving this, such as using pre-trained classifiers to detect unsafe prompts, or fine-tuning the text-to-image model itself to avoid generating unsafe content. By implementing these safeguards, the hope is to make text-to-image models more responsible and ethical.

Technical Explanation

The SafeGen framework consists of two main components:

Prompt Classifier: This is a machine learning model that is trained to identify unsafe prompts, i.e., textual descriptions that are likely to result in the generation of harmful or unethical content. The authors experiment with different classifier architectures, such as BERT and RoBERTa, and explore various training strategies to improve the classifier's performance.
Prompt Filtering: Once an unsafe prompt is detected by the classifier, the framework can either block the prompt from being processed by the text-to-image model or replace it with a safe alternative. The authors investigate different filtering approaches, such as using a pre-defined set of banned words or phrases, or generating safe prompt suggestions based on the original unsafe prompt.

The authors evaluate the effectiveness of the SafeGen framework on several text-to-image models, including DALL-E and Stable Diffusion. Their results show that the framework is able to successfully identify and filter out a significant portion of unsafe prompts, while maintaining a high level of image quality for the remaining safe prompts.

Critical Analysis

The SafeGen framework represents an important step towards addressing the safety and ethical concerns associated with text-to-image models. By proactively detecting and filtering out unsafe prompts, the framework helps to mitigate the risk of these models being used to generate harmful content.

However, it's important to note that the framework is not a silver bullet. The authors acknowledge that the prompt classifier may not be able to detect all unsafe prompts, and there may be cases where safe prompts are incorrectly identified as unsafe. Additionally, the framework does not address other potential ethical issues, such as the representation of marginalized groups or the propagation of biases in the generated images.

Further research is needed to address these challenges and to explore other approaches to ensuring the responsible development and deployment of text-to-image models. This could include investigating techniques for detecting unauthorized data usages in the models, or developing more comprehensive frameworks for responsible generative AI.

Conclusion

The SafeGen framework represents an important step towards mitigating the risks associated with unsafe content generation in text-to-image models. By using machine learning to identify and filter out unsafe prompts, the framework helps to make these models more responsible and ethical. While the framework is not a complete solution, it demonstrates the potential for incorporating safety and ethical considerations into the development of powerful AI tools like text-to-image models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu

Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexually explicit scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block sexually explicit content (e.g., naked) but may still be vulnerable to adversarial prompts -- inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate sexual content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate explicit visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since such unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets and large-scale user studies demonstrate SafeGen's effectiveness in mitigating sexually explicit content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.4% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.

9/17/2024

Dark Miner: Defend against unsafe generation for text-to-image diffusion models

Zheling Meng, Bo Peng, Xiaochuan Jin, Yue Jiang, Jing Dong, Wei Wang, Tieniu Tan

Text-to-image diffusion models have been demonstrated with unsafe generation due to unfiltered large-scale training data, such as violent, sexual, and shocking images, necessitating the erasure of unsafe concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing unsafe descriptions. However, they fail to guarantee safe generation for unseen texts in the training phase, especially for the prompts from adversarial attacks. In this paper, we re-analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of unsafe generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. It greedily mines embeddings with maximum generation probabilities of unsafe concepts and reduces unsafe generation more effectively. In the experiments, we evaluate its performance on two inappropriate concepts, two objects, and two styles. Compared with 6 previous state-of-the-art methods, our method achieves better erasure and defense results in most cases, especially under 4 state-of-the-art attacks, while preserving the model's native generation capability. Our code will be available on GitHub.

9/27/2024

Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines. Code and data will be shared at https://latentguard.github.io/.

8/20/2024

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee

Recent advancements in text-to-image (T2I) models have greatly benefited from large-scale datasets, but they also pose significant risks due to the potential generation of unsafe content. To mitigate this issue, researchers have developed unlearning techniques to remove the model's ability to generate potentially harmful content. However, these methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images. In this paper, we propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models while preserving their performance on unrelated topics. DUO employs a preference optimization approach using curated paired image data, ensuring that the model learns to remove unsafe visual concepts while retaining unrelated features. Furthermore, we introduce an output-preserving regularization term to maintain the model's generative capabilities on safe content. Extensive experiments demonstrate that DUO can robustly defend against various state-of-the-art red teaming methods without significant performance degradation on unrelated topics, as measured by FID and CLIP scores. Our work contributes to the development of safer and more reliable T2I models, paving the way for their responsible deployment in both closed-source and open-source scenarios.

8/1/2024