SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

2404.06666

Published 4/11/2024 by Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Abstract

Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block explicit NSFW-related content (e.g., naked or sexy) but may still be vulnerable to adversarial prompts inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate unsafe visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets demonstrate SafeGen's effectiveness in mitigating unsafe content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.1% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.

Create account to get full access

Overview

This paper proposes "SafeGen", a framework to mitigate unsafe content generation in text-to-image models.
The key idea is to identify and filter out unsafe prompts during the image generation process.
The authors explore different approaches, including using pre-trained classifiers to detect unsafe prompts and fine-tuning models to avoid generating unsafe content.

Plain English Explanation

Text-to-image models, like DALL-E and Stable Diffusion, are powerful tools that can create images from textual descriptions. However, these models can also be used to generate harmful or unethical content, such as explicit or violent imagery.

The SafeGen framework aims to address this issue by detecting and filtering out unsafe prompts before the image is generated. The key idea is to use machine learning models to identify prompts that are likely to result in unsafe content, and then block those prompts from being processed by the text-to-image model.

The researchers explore different approaches to achieving this, such as using pre-trained classifiers to detect unsafe prompts, or fine-tuning the text-to-image model itself to avoid generating unsafe content. By implementing these safeguards, the hope is to make text-to-image models more responsible and ethical.

Technical Explanation

The SafeGen framework consists of two main components:

Prompt Classifier: This is a machine learning model that is trained to identify unsafe prompts, i.e., textual descriptions that are likely to result in the generation of harmful or unethical content. The authors experiment with different classifier architectures, such as BERT and RoBERTa, and explore various training strategies to improve the classifier's performance.
Prompt Filtering: Once an unsafe prompt is detected by the classifier, the framework can either block the prompt from being processed by the text-to-image model or replace it with a safe alternative. The authors investigate different filtering approaches, such as using a pre-defined set of banned words or phrases, or generating safe prompt suggestions based on the original unsafe prompt.

The authors evaluate the effectiveness of the SafeGen framework on several text-to-image models, including DALL-E and Stable Diffusion. Their results show that the framework is able to successfully identify and filter out a significant portion of unsafe prompts, while maintaining a high level of image quality for the remaining safe prompts.

Critical Analysis

The SafeGen framework represents an important step towards addressing the safety and ethical concerns associated with text-to-image models. By proactively detecting and filtering out unsafe prompts, the framework helps to mitigate the risk of these models being used to generate harmful content.

However, it's important to note that the framework is not a silver bullet. The authors acknowledge that the prompt classifier may not be able to detect all unsafe prompts, and there may be cases where safe prompts are incorrectly identified as unsafe. Additionally, the framework does not address other potential ethical issues, such as the representation of marginalized groups or the propagation of biases in the generated images.

Further research is needed to address these challenges and to explore other approaches to ensuring the responsible development and deployment of text-to-image models. This could include investigating techniques for detecting unauthorized data usages in the models, or developing more comprehensive frameworks for responsible generative AI.

Conclusion

The SafeGen framework represents an important step towards mitigating the risks associated with unsafe content generation in text-to-image models. By using machine learning to identify and filter out unsafe prompts, the framework helps to make these models more responsible and ethical. While the framework is not a complete solution, it demonstrates the potential for incorporating safety and ethical considerations into the development of powerful AI tools like text-to-image models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines. Code and data will be shared at https://github.com/rt219/LatentGuard.

4/15/2024

cs.CV cs.AI cs.LG

Universal Prompt Optimizer for Safe Text-to-Image Generation

Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang

Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, we propose the first universal prompt optimizer for safe T2I (POSI) generation in black-box scenario. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance. Our code is available at https://github.com/wzongyu/POSI.

5/21/2024

cs.CV cs.CL

🤿

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever toxic linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

4/15/2024

cs.CV cs.AI cs.CL cs.MM

Harm Amplification in Text-to-Image Models

Susan Hao, Renee Shelby, Yuchi Liu, Hansa Srinivasan, Mukul Bhutani, Burcu Karagol Ayan, Ryan Poplin, Shivani Poddar, Sarah Laszlo

Text-to-image (T2I) models have emerged as a significant advancement in generative AI; however, there exist safety concerns regarding their potential to produce harmful image outputs even when users input seemingly safe prompts. This phenomenon, where T2I models generate harmful representations that were not explicit in the input, poses a potentially greater risk than adversarial prompts, leaving users unintentionally exposed to harms. Our paper addresses this issue by formalizing a definition for this phenomenon which we term harm amplification. We further contribute to the field by developing a framework of methodologies to quantify harm amplification in which we consider the harm of the model output in the context of user input. We then empirically examine how to apply these different methodologies to simulate real-world deployment scenarios including a quantification of disparate impacts across genders resulting from harm amplification. Together, our work aims to offer researchers tools to comprehensively address safety challenges in T2I systems and contribute to the responsible deployment of generative AI models.

5/21/2024

cs.CY cs.AI cs.LG