Multistep Distillation of Diffusion Models via Moment Matching

Read original: arXiv:2406.04103 - Published 6/7/2024 by Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom

Multistep Distillation of Diffusion Models via Moment Matching

Overview

This paper introduces a novel technique called "Multistep Distillation of Diffusion Models via Moment Matching" for accelerating the generation of high-quality images from diffusion models.
The key idea is to train a smaller, more efficient generator model to match the intermediate feature distributions of a pre-trained diffusion model, rather than just the final output distribution.
This approach allows for faster image synthesis without significant loss in quality, potentially enabling more practical use of diffusion models in real-world applications.

Plain English Explanation

Diffusion models are a powerful class of machine learning models that can generate highly realistic images. However, they can be computationally expensive and slow to generate new images. The authors of this paper propose a new technique to address this issue.

The core idea is to "distill" the knowledge from a pre-trained diffusion model into a smaller, more efficient generator model. Rather than just trying to match the final output distribution of the diffusion model, the generator model is trained to also match the intermediate feature distributions at different steps of the diffusion process. This builds on prior work in "distillation" of diffusion models, such as the papers linked above.

By capturing these intermediate feature distributions, the generator model can learn to generate high-quality images much faster than the original diffusion model. This is similar to the "EM-Distillation" technique for one-step diffusion models, but extended to the more complex multistep diffusion setting.

The key advantage of this approach is that it allows diffusion models to be used more practically in real-world applications that require fast image generation, without sacrificing too much in terms of image quality. This connects to other work on improving the efficiency of diffusion models, such as "Flash Diffusion" and "Imagine-FLASH".

Technical Explanation

The authors propose a "multistep distillation" technique to train a generator model G to match the intermediate feature distributions of a pre-trained diffusion model F at multiple timesteps during the diffusion process.

Specifically, the training objective involves minimizing the distance between the feature moments (mean and covariance) of G and F at each timestep. This is done by adding additional loss terms to the standard GAN training objective, which encourage the generator to not only match the final output distribution, but also the internal representations at intermediate diffusion steps.

The authors demonstrate the effectiveness of this approach on several image generation benchmarks, showing that the distilled generator model can achieve comparable or better image quality than the original diffusion model, while being significantly faster at sample generation.

Critical Analysis

The authors provide a thorough experimental evaluation of their proposed method, considering different distillation strategies, architecture choices, and benchmarks. They also discuss several limitations and potential extensions of their work.

One important caveat is that the distillation process can be sensitive to hyperparameter choices and the specific architecture of the generator model. The authors note that in some cases, their approach may not outperform more direct diffusion sampling, depending on the computational budget and hardware available.

Another limitation is that the paper does not provide a deep theoretical analysis of why the multistep distillation approach works well. While the intuition behind matching intermediate feature distributions is clear, a more rigorous understanding of the optimization landscape and convergence properties could further strengthen the contribution.

Additionally, the authors do not explore the use of their technique for conditional image generation tasks, which are arguably more relevant for real-world applications. Extending the multistep distillation approach to conditional diffusion models could be an interesting direction for future research.

Conclusion

This paper introduces a novel technique called "Multistep Distillation of Diffusion Models via Moment Matching" that aims to accelerate the generation of high-quality images from diffusion models. By training a smaller generator model to match the intermediate feature distributions of a pre-trained diffusion model, the authors demonstrate significant speedups in image synthesis without substantial quality degradation.

The key contribution of this work is the ability to leverage the powerful generative capabilities of diffusion models in more practical real-world applications that require fast image generation, such as interactive applications or large-scale content creation. The multistep distillation approach builds upon and extends prior research on diffusion model distillation, offering a promising direction for improving the efficiency and usability of these state-of-the-art generative models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Multistep Distillation of Diffusion Models via Moment Matching

Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom

We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.

6/7/2024

📉

Distilling Diffusion Models into Conditional GANs

Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.

7/19/2024

EM Distillation for One-step Diffusion Models

Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, Ruiqi Gao

While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.

5/28/2024

MLCM: Multistep Consistency Distillation of Latent Diffusion Model

Qingsong Xie, Zhenyi Liao, Chen chen, Zhijie Deng, Shixiang Tang, Haonan Lu

Distilling large latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face a dilemma where they either (i) depend on multiple individual distilled models for different sampling budgets, or (ii) sacrifice generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8) sampling steps. To address these, we extend the recent multistep consistency distillation (MCD) strategy to representative LDMs, establishing the Multistep Latent Consistency Models (MLCMs) approach for low-cost high-quality image synthesis. MLCM serves as a unified model for various sampling steps due to the promise of MCD. We further augment MCD with a progressive training strategy to strengthen inter-segment consistency to boost the quality of few-step generations. We take the states from the sampling trajectories of the teacher model as training data for MLCMs to lift the requirements for high-quality training datasets and to bridge the gap between the training and inference of the distilled model. MLCM is compatible with preference learning strategies for further improvement of visual quality and aesthetic appeal. Empirically, MLCM can generate high-quality, delightful images with only 2-8 sampling steps. On the MSCOCO-2017 5K benchmark, MLCM distilled from SDXL gets a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 with only 4 steps, substantially surpassing 4-step LCM [23], 8-step SDXL-Lightning [17], and 8-step HyperSD [33]. We also demonstrate the versatility of MLCMs in applications including controllable generation, image style transfer, and Chinese-to-image generation.

6/13/2024