Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

2405.05224

Published 5/9/2024 by Jonas Kohler, Albert Pumarola, Edgar Schonfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, Ali Thabet

cs.CV

➖

Abstract

Diffusion models are a powerful generative framework, but come with expensive inference. Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime. In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps. Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction. Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient high-quality generation.

Create account to get full access

Overview

Diffusion models are powerful generative models, but their inference can be computationally expensive.
Existing methods to accelerate diffusion models often compromise image quality or fail when working with complex conditions in a low-step regime.
This paper proposes a novel distillation framework to enable high-fidelity, diverse sample generation using just one to three steps.

Plain English Explanation

Diffusion models are a type of machine learning algorithm that can generate new images, text, or other data by learning from examples. They work by gradually adding noise to an input, then learning to reverse that process to generate new samples. However, running a diffusion model through many noise-adding steps can be slow and computationally intensive.

This paper introduces a new approach to make diffusion models faster, while still maintaining the quality of the generated samples. The key ideas are:

Backward Distillation: The model is trained not just on the final output, but also on the intermediate steps along the way. This helps the model learn to generate high-quality samples more efficiently.
Shifted Reconstruction Loss: The training loss adapts based on the current step, encouraging the model to learn how to generate good samples at each stage of the process.
Noise Correction: At inference time, an additional step is added to enhance the quality of the generated samples by addressing issues in the noise prediction.

The paper shows that this approach outperforms existing methods, achieving results comparable to the original, computationally expensive diffusion model, but using only 1-3 steps. This makes diffusion models much more practical for real-world applications that require fast, high-quality generation.

Technical Explanation

The proposed distillation framework consists of three key components:

Backward Distillation: To mitigate the discrepancy between training and inference, the student model is trained not only on the final output, but also on its own intermediate steps during the diffusion process. This helps the student learn to generate high-quality samples more efficiently.
Shifted Reconstruction Loss: The training loss is dynamically adapted based on the current time step. This encourages the student model to learn how to generate good samples at each stage of the diffusion process, rather than just focusing on the final output.
Noise Correction: An inference-time technique is introduced to enhance sample quality by addressing singularities in the noise prediction. This helps address issues that can arise when working with a low number of diffusion steps.

The authors evaluate their approach through extensive experiments, demonstrating that it outperforms existing acceleration methods in both quantitative metrics and human evaluations. Remarkably, the student model achieves performance comparable to the original, computationally expensive teacher model, but using only 3 denoising steps.

Critical Analysis

The paper provides a compelling approach to accelerating diffusion models while maintaining high-quality sample generation. The key innovations, such as Backward Distillation and Shifted Reconstruction Loss, are well-designed to address the challenges of working in a low-step regime.

However, the paper does not discuss the potential limitations or caveats of this approach. For example, it's unclear how the method would perform on more complex or diverse datasets, or how sensitive the results are to hyperparameter choices. Additionally, the paper does not explore the computational and memory efficiency of the proposed framework compared to the original diffusion model.

Further research could investigate the robustness and scalability of this approach, as well as explore ways to further reduce the number of required diffusion steps without sacrificing sample quality. Comparisons to other recent acceleration methods, such as SwiftBrush or Diffusion Time Step Curriculum, could also provide additional insights.

Conclusion

This paper presents a novel distillation framework that enables high-fidelity, diverse sample generation from diffusion models using just 1-3 denoising steps. By addressing the training-inference discrepancy, dynamically adapting the loss function, and correcting noise predictions, the authors have developed a practical approach to accelerating diffusion models without compromising image quality.

If further validated and refined, this work could have significant implications for the deployment of diffusion models in real-world applications, where computational efficiency and fast generation are critical. The principles introduced here may also inspire new directions for accelerating other types of generative models beyond just diffusion.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

Clement Chadebec, Onur Tasar, Eyal Benaroche, Benjamin Aubin

In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swapping, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$alpha$), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at https://github.com/gojasper/flash-diffusion.

6/7/2024

cs.CV cs.AI cs.LG

EM Distillation for One-step Diffusion Models

Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, Ruiqi Gao

While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.

5/28/2024

cs.LG cs.AI stat.ML

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

Hongjian Liu, Qingsong Xie, Zhijie Deng, Chen Chen, Shixiang Tang, Fueyang Fu, Zheng-jun Zha, Haonan Lu

The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality generations can be achieved with just 1-2 sampling steps, and further improvements can be obtained by adding additional steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pretrained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the sample quality with rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID (Frechet Inceptio Distance) of 22.1, surpassing that (23.4) of the 1-step InstaFlow (Liu et al., 2023) and matching that of 4-step UFOGen (Xue et al., 2023b). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation (Luo et al., 2023a), with up to 16% improvement in a qualified metric. The code and checkpoints are coming soon.

4/16/2024

cs.CV

Plug-and-Play Diffusion Distillation

Yi-Ting Hsiao, Siavash Khodadadeh, Kevin Duarte, Wei-An Lin, Hui Qu, Mingi Kwon, Ratheesh Kalarot

Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this plug-and-play functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.

6/17/2024

cs.CV