EM Distillation for One-step Diffusion Models

2405.16852

Published 5/28/2024 by Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, Ruiqi Gao

cs.LG cs.AI stat.ML

EM Distillation for One-step Diffusion Models

Abstract

While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.

Create account to get full access

Overview

This paper introduces a novel technique called "EM Distillation" for training one-step diffusion models, which are a type of generative model used for tasks like image synthesis.
The main idea is to distill the knowledge from a pre-trained diffusion model into a simpler, more efficient one-step model using an Expectation-Maximization (EM) algorithm.
The authors show that EM Distillation can achieve performance on par with the original diffusion model, while being significantly faster and more memory-efficient.

Plain English Explanation

Diffusion models are a powerful type of machine learning model that can be used to generate realistic images, audio, and other types of data. However, they can be computationally intensive and slow to use, which limits their real-world applications.

EM Distillation for One-step Diffusion Models introduces a new way to make diffusion models faster and more efficient. The key idea is to take a pre-trained diffusion model and use a technique called "Expectation-Maximization (EM) Distillation" to transfer its knowledge into a simpler, faster model.

This simpler model, called a "one-step" diffusion model, can then be used to generate new samples just as well as the original diffusion model, but much more quickly and with less memory usage. This could make diffusion models much more practical for real-world applications like image editing, video synthesis, and medical imaging.

The paper shows that EM Distillation can match the performance of the original diffusion model, while being 5-10x faster and using 10-20x less memory. This represents a significant improvement in the efficiency and usability of diffusion models, which could help unlock their full potential in a wide range of domains.

Technical Explanation

The paper introduces a novel technique called "EM Distillation" for training one-step diffusion models. Diffusion models are a type of generative model that work by gradually adding noise to an input, then learning to reverse the process to generate new samples.

Distilling diffusion models into conditional GANs and Improved distribution matching distillation for fast image synthesis have explored ways to distill diffusion models into smaller, more efficient models. However, these approaches have limitations in terms of sample quality and computational efficiency.

EM Distillation addresses these issues by using an Expectation-Maximization (EM) algorithm to transfer the knowledge from a pre-trained diffusion model into a simpler one-step model. The one-step model is trained to directly map a noise vector to a sample, without the need for the multi-step diffusion process.

The authors show that EM Distillation can match the performance of the original diffusion model, while being 5-10x faster and using 10-20x less memory. This is achieved by carefully designing the EM algorithm to efficiently learn the parameters of the one-step model.

The paper also introduces several other techniques to improve the performance of one-step diffusion models, such as SFDDM: Single-Fold Distillation of Diffusion Models and Imagine Flash: Accelerating EMU Diffusion Models Backward. These methods further enhance the efficiency and sample quality of the one-step models.

Overall, the EM Distillation approach represents a significant advance in making diffusion models more practical and accessible for real-world applications.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the EM Distillation technique, demonstrating its effectiveness on a range of image synthesis tasks. The authors provide a clear and detailed explanation of the method, as well as comprehensive experimental results.

One potential limitation of the approach is that it relies on having access to a pre-trained diffusion model, which may not always be available. The authors do not explore how EM Distillation would perform in the case of training a one-step model from scratch, without the benefit of a pre-trained teacher model.

Additionally, the paper does not address the potential for learning diffusion priors from observations by expectation, which could provide further improvements in the efficiency and performance of one-step diffusion models.

Overall, the EM Distillation technique presented in this paper represents an important step forward in making diffusion models more practical and widely applicable. The authors have clearly demonstrated the potential of this approach, and it will be interesting to see how it is further developed and applied in the future.

Conclusion

The EM Distillation technique introduced in this paper offers a promising solution for making diffusion models more efficient and practical for real-world applications. By distilling the knowledge from a pre-trained diffusion model into a simpler one-step model, the authors have shown that it is possible to achieve comparable performance with significant improvements in speed and memory usage.

This work has the potential to unlock the full potential of diffusion models in a wide range of domains, from image and video synthesis to medical imaging and beyond. By making these powerful generative models more accessible and efficient, the EM Distillation approach could pave the way for new and exciting applications that were previously out of reach.

As the field of machine learning continues to advance, techniques like EM Distillation will play an increasingly important role in bridging the gap between cutting-edge research and real-world deployment. This paper serves as an inspiring example of how innovative algorithmic approaches can help overcome the challenges of deploying complex models in practical settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Diffusion Models Are Innate One-Step Generators

Bowen Zheng, Tianming Yang

Diffusion Models (DMs) have achieved great success in image generation and other fields. By fine sampling through the trajectory defined by the SDE/ODE solver based on a well-trained score model, DMs can generate remarkable high-quality results. However, this precise sampling often requires multiple steps and is computationally demanding. To address this problem, instance-based distillation methods have been proposed to distill a one-step generator from a DM by having a simpler student model mimic a more complex teacher model. Yet, our research reveals an inherent limitations in these methods: the teacher model, with more steps and more parameters, occupies different local minima compared to the student model, leading to suboptimal performance when the student model attempts to replicate the teacher. To avoid this problem, we introduce a novel distributional distillation method, which uses an exclusive distributional loss. This method exceeds state-of-the-art (SOTA) results while requiring significantly fewer training images. Additionally, we show that DMs' layers are differentially activated at different time steps, leading to an inherent capability to generate images in a single step. Freezing most of the convolutional layers in a DM during distributional distillation enables this innate capability and leads to further performance improvements. Our method achieves the SOTA results on CIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are obtained with only 5 million training images within 6 hours on 8 A100 GPUs.

6/10/2024

cs.CV

Multistep Distillation of Diffusion Models via Moment Matching

Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom

We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.

6/7/2024

cs.LG cs.AI cs.CV cs.NE

📉

Distilling Diffusion Models into Conditional GANs

Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.

6/17/2024

cs.CV cs.GR cs.LG

SFDDM: Single-fold Distillation for Diffusion models

Chi Hong, Jiyue Huang, Robert Birke, Dick Epema, Stefanie Roos, Lydia Y. Chen

While diffusion models effectively generate remarkable synthetic images, a key limitation is the inference inefficiency, requiring numerous sampling steps. To accelerate inference and maintain high-quality synthesis, teacher-student distillation is applied to compress the diffusion models in a progressive and binary manner by retraining, e.g., reducing the 1024-step model to a 128-step model in 3 folds. In this paper, we propose a single-fold distillation algorithm, SFDDM, which can flexibly compress the teacher diffusion model into a student model of any desired step, based on reparameterization of the intermediate inputs from the teacher model. To train the student diffusion, we minimize not only the output distance but also the distribution of the hidden variables between the teacher and student model. Extensive experiments on four datasets demonstrate that our student model trained by the proposed SFDDM is able to sample high-quality data with steps reduced to as little as approximately 1%, thus, trading off inference time. Our remarkable performance highlights that SFDDM effectively transfers knowledge in single-fold distillation, achieving semantic consistency and meaningful image interpolation.

5/27/2024

cs.CV cs.LG