Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

2406.14539

Published 6/27/2024 by Nikita Starodubcev, Mikhail Khoroshikh, Artem Babenko, Dmitry Baranchuk

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

Abstract

Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.

Create account to get full access

Overview

This paper presents a novel approach called "Invertible Consistency Distillation" for text-guided image editing.
The method aims to address the limitations of existing text-to-image generation models by enabling more precise and controllable image editing.
Key ideas include inverting a text-conditional diffusion model to enable bidirectional text-to-image and image-to-text translation, as well as leveraging consistency between these translations to improve the editing process.

Plain English Explanation

The paper describes a new way to edit images using text prompts. Existing text-to-image models can generate images from scratch based on text, but have limited control over the editing process. The proposed "Invertible Consistency Distillation" method solves this by allowing the model to not only generate images from text, but also translate images back to text.

This bidirectional translation capability enables more precise image editing. For example, if you have an image and want to change a specific detail, you can describe that detail in text, and the model will update the image accordingly. The key insight is that by ensuring consistency between the text and image translations, the model can maintain coherence during the editing process.

Technical Explanation

The paper builds on recent advances in diffusion models and text-to-image generation. It proposes an "invertible" text-conditional diffusion model that can translate text to images and vice versa. To achieve this, the model is trained using a novel "consistency distillation" objective that encourages the text-to-image and image-to-text translations to be coherent.

Experiments show that this approach outperforms existing text-guided image editing methods on tasks like object removal, attribute editing, and image inpainting. The authors also demonstrate that the model can be efficiently fine-tuned on specific editing tasks, making it a versatile and practical tool for interactive image editing.

Critical Analysis

The paper presents a compelling and technically sophisticated solution to the challenge of text-guided image editing. By enabling bidirectional translation between text and images, the method provides a level of control and precision that overcomes the limitations of previous approaches.

However, the paper does not address potential biases or safety concerns that may arise from such a powerful text-to-image editing tool. There is also limited discussion of the computational and memory requirements of the model, which could be a practical concern for real-world deployment.

Additionally, the paper focuses primarily on the technical aspects of the method and does not explore the broader implications or societal impact of text-guided image editing. Further research may be needed to understand the ethical considerations and potential misuse cases of this technology.

Conclusion

The "Invertible Consistency Distillation" approach represents a significant advancement in the field of text-guided image editing. By leveraging the strengths of diffusion models and introducing a novel consistency-based training objective, the method enables bidirectional translation between text and images, leading to more precise and controllable editing capabilities.

While the technical merits of the research are strong, the paper could be strengthened by a more thorough discussion of the ethical and practical considerations surrounding this technology. As text-to-image models continue to evolve, it will be crucial to address these important issues and ensure that the technology is developed and deployed responsibly.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Plug-and-Play Diffusion Distillation

Yi-Ting Hsiao, Siavash Khodadadeh, Kevin Duarte, Wei-An Lin, Hui Qu, Mingi Kwon, Ratheesh Kalarot

Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this plug-and-play functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.

6/17/2024

cs.CV

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

Hongjian Liu, Qingsong Xie, Zhijie Deng, Chen Chen, Shixiang Tang, Fueyang Fu, Zheng-jun Zha, Haonan Lu

The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality generations can be achieved with just 1-2 sampling steps, and further improvements can be obtained by adding additional steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pretrained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the sample quality with rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID (Frechet Inceptio Distance) of 22.1, surpassing that (23.4) of the 1-step InstaFlow (Liu et al., 2023) and matching that of 4-step UFOGen (Xue et al., 2023b). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation (Luo et al., 2023a), with up to 16% improvement in a qualified metric. The code and checkpoints are coming soon.

4/16/2024

cs.CV

EdgeFusion: On-Device Text-to-Image Generation

Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, Tae-Ho Kim

The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

4/19/2024

cs.LG cs.AI cs.CV

📉

Distilling Diffusion Models into Conditional GANs

Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.

6/17/2024

cs.CV cs.GR cs.LG