Diffusion Models Are Innate One-Step Generators

2405.20750

Published 6/10/2024 by Bowen Zheng, Tianming Yang

Diffusion Models Are Innate One-Step Generators

Abstract

Diffusion Models (DMs) have achieved great success in image generation and other fields. By fine sampling through the trajectory defined by the SDE/ODE solver based on a well-trained score model, DMs can generate remarkable high-quality results. However, this precise sampling often requires multiple steps and is computationally demanding. To address this problem, instance-based distillation methods have been proposed to distill a one-step generator from a DM by having a simpler student model mimic a more complex teacher model. Yet, our research reveals an inherent limitations in these methods: the teacher model, with more steps and more parameters, occupies different local minima compared to the student model, leading to suboptimal performance when the student model attempts to replicate the teacher. To avoid this problem, we introduce a novel distributional distillation method, which uses an exclusive distributional loss. This method exceeds state-of-the-art (SOTA) results while requiring significantly fewer training images. Additionally, we show that DMs' layers are differentially activated at different time steps, leading to an inherent capability to generate images in a single step. Freezing most of the convolutional layers in a DM during distributional distillation enables this innate capability and leads to further performance improvements. Our method achieves the SOTA results on CIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are obtained with only 5 million training images within 6 hours on 8 A100 GPUs.

Create account to get full access

Overview

Diffusion Models Are Innate One-Step Generators explores a new perspective on diffusion models, showing that they can be viewed as innate one-step generators.
The paper challenges the common assumption that diffusion models are slow and multi-step, and presents methods to accelerate diffusion models for fast image synthesis.
The findings have implications for improved distribution matching, one-step diffusion models, and stochastic consistency distillation.

Plain English Explanation

Diffusion models are a type of machine learning model that can be used to generate new images. They work by gradually adding noise to an image, then learning to reverse that process to generate new images.

This paper looks at a different way of thinking about diffusion models. Rather than seeing them as slow, multi-step processes, the researchers argue that diffusion models are actually innate one-step generators. This means they can generate new images in a single step, without the need for a long, gradual process.

The paper presents methods to accelerate diffusion models and make them faster at generating images. This could lead to improved distribution matching, one-step diffusion models, and stochastic consistency distillation. Overall, the findings challenge common assumptions about diffusion models and open up new possibilities for fast, efficient image generation.

Technical Explanation

The paper Diffusion Models Are Innate One-Step Generators proposes a new perspective on diffusion models, showing that they can be viewed as innate one-step generators rather than slow, multi-step processes.

The researchers demonstrate that diffusion models can be accelerated using methods like direct denoising and stochastic consistency distillation. This challenges the common assumption that diffusion models are inherently slow and require many steps to generate new images.

The paper presents experiments that show diffusion models can achieve comparable or even better performance to other image generation methods, but in a single step. This has implications for improved distribution matching and the development of one-step diffusion models.

Critical Analysis

The paper makes a compelling case that diffusion models can be viewed as innate one-step generators, challenging the prevailing assumption that they are slow and require multi-step processes. The proposed acceleration methods like direct denoising and stochastic consistency distillation are promising approaches to improve the efficiency of diffusion models.

However, the paper does not address potential limitations or caveats of this perspective. For example, it's unclear how the one-step generation capabilities would scale to higher-resolution or more complex image domains. There may also be trade-offs between generation speed and other metrics like sample quality or mode coverage.

Additionally, the paper focuses on image generation, but diffusion models have been applied to a variety of other domains like text, audio, and molecular modeling. It's unclear if the one-step generator perspective would translate equally well to these other applications.

Overall, the paper presents an intriguing new way of thinking about diffusion models, but further research is needed to fully understand the implications and limitations of this approach.

Conclusion

This paper challenges the common perception of diffusion models as inherently slow and multi-step, proposing a new perspective that views them as innate one-step generators. The presented acceleration methods, such as direct denoising and stochastic consistency distillation, demonstrate the potential for diffusion models to achieve fast, efficient image synthesis.

These findings have important implications for the development of improved distribution matching, one-step diffusion models, and stochastic consistency distillation. By rethinking the fundamental nature of diffusion models, the paper opens up new avenues for research and innovation in the field of generative modeling.

While further work is needed to fully explore the limitations and broader applicability of this perspective, the paper's central insight – that diffusion models can be viewed as one-step generators – represents a significant advancement in our understanding of this powerful class of machine learning models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SFDDM: Single-fold Distillation for Diffusion models

Chi Hong, Jiyue Huang, Robert Birke, Dick Epema, Stefanie Roos, Lydia Y. Chen

While diffusion models effectively generate remarkable synthetic images, a key limitation is the inference inefficiency, requiring numerous sampling steps. To accelerate inference and maintain high-quality synthesis, teacher-student distillation is applied to compress the diffusion models in a progressive and binary manner by retraining, e.g., reducing the 1024-step model to a 128-step model in 3 folds. In this paper, we propose a single-fold distillation algorithm, SFDDM, which can flexibly compress the teacher diffusion model into a student model of any desired step, based on reparameterization of the intermediate inputs from the teacher model. To train the student diffusion, we minimize not only the output distance but also the distribution of the hidden variables between the teacher and student model. Extensive experiments on four datasets demonstrate that our student model trained by the proposed SFDDM is able to sample high-quality data with steps reduced to as little as approximately 1%, thus, trading off inference time. Our remarkable performance highlights that SFDDM effectively transfers knowledge in single-fold distillation, achieving semantic consistency and meaningful image interpolation.

5/27/2024

cs.CV cs.LG

🖼️

Improved Distribution Matching Distillation for Fast Image Synthesis

Tianwei Yin, Michael Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman

Recent approaches have shown promises distilling diffusion models into efficient one-step generators. Among them, Distribution Matching Distillation (DMD) produces one-step generators that match their teacher in distribution, without enforcing a one-to-one correspondence with the sampling trajectories of their teachers. However, to ensure stable training, DMD requires an additional regression loss computed using a large set of noise-image pairs generated by the teacher with many steps of a deterministic sampler. This is costly for large-scale text-to-image synthesis and limits the student's quality, tying it too closely to the teacher's original sampling paths. We introduce DMD2, a set of techniques that lift this limitation and improve DMD training. First, we eliminate the regression loss and the need for expensive dataset construction. We show that the resulting instability is due to the fake critic not estimating the distribution of generated samples accurately and propose a two time-scale update rule as a remedy. Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images. This lets us train the student model on real data, mitigating the imperfect real score estimation from the teacher model, and enhancing quality. Lastly, we modify the training procedure to enable multi-step sampling. We identify and address the training-inference input mismatch problem in this setting, by simulating inference-time generator samples during training time. Taken together, our improvements set new benchmarks in one-step image generation, with FID scores of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the original teacher despite a 500X reduction in inference cost. Further, we show our approach can generate megapixel images by distilling SDXL, demonstrating exceptional visual quality among few-step methods.

5/27/2024

cs.CV

📈

Directly Denoising Diffusion Model

Dan Zhang, Jingjing Wang, Feng Luo

In this paper, we present the Directly Denoising Diffusion Model (DDDM): a simple and generic approach for generating realistic images with few-step sampling, while multistep sampling is still preserved for better performance. DDDMs require no delicately designed samplers nor distillation on pre-trained distillation models. DDDMs train the diffusion model conditioned on an estimated target that was generated from previous training iterations of its own. To generate images, samples generated from the previous time step are also taken into consideration, guiding the generation process iteratively. We further propose Pseudo-LPIPS, a novel metric loss that is more robust to various values of hyperparameter. Despite its simplicity, the proposed approach can achieve strong performance in benchmark datasets. Our model achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing those obtained from GANs and distillation-based models. By extending the sampling to 1000 steps, we further reduce FID score to 1.79, aligning with state-of-the-art methods in the literature. For ImageNet 64x64, our approach stands as a competitive contender against leading models.

6/3/2024

cs.CV

EM Distillation for One-step Diffusion Models

Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying Nian Wu, Kevin Patrick Murphy, Tim Salimans, Ben Poole, Ruiqi Gao

While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.

5/28/2024

cs.LG cs.AI stat.ML