Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

2406.06890

Published 6/12/2024 by Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang

cs.CV

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Abstract

Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data. We propose motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM includes a video consistency model that distills motion from the video teacher model, and an image discriminator that enhances frame appearance to match high-quality image data. This combination presents two challenges: (1) conflicting frame learning objectives, as video distillation learns from low-quality video frames while the image discriminator targets high-quality images; and (2) training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data.

Create account to get full access

Overview

This paper introduces the Motion Consistency Model, a novel approach to accelerate video diffusion by disentangling motion and appearance during the diffusion process.
The model leverages a motion-appearance distillation technique to transfer knowledge from a pre-trained optical flow model, improving the quality and consistency of generated video frames.
The paper also explores various strategies for incorporating motion consistency, including multi-step distillation and stochastic consistency distillation, drawing insights from related work in MLCM, Scott, and MotionMaster.

Plain English Explanation

The paper describes a new way to generate high-quality videos using a technique called "diffusion." Diffusion models work by adding noise to an image or video, then gradually removing that noise to create a new, realistic-looking image or video.

The key insight of this paper is that by separating the motion and appearance of the video during the diffusion process, the model can generate videos that are more consistent and realistic. The authors use a technique called "motion-appearance distillation" to transfer knowledge from a pre-trained optical flow model, which helps the diffusion model better understand how objects should move in the video.

The paper also explores different strategies for incorporating this motion consistency, such as multi-step distillation and stochastic consistency distillation. These approaches aim to further improve the quality and coherence of the generated videos.

The work builds on insights from related research, such as MotionMaster, which focuses on transferring camera motion, and Dancing Still Images to Video, which explores using static images to generate video.

Technical Explanation

The Motion Consistency Model uses a diffusion-based approach to generate high-quality videos. Diffusion models work by gradually adding noise to an image or video, then removing that noise to create a new, realistic-looking output.

The key innovation of this paper is the use of a motion-appearance distillation technique to separate the motion and appearance components of the video during the diffusion process. This is achieved by training the diffusion model to mimic the behavior of a pre-trained optical flow model, which can capture the movement of objects in the video.

The authors explore several strategies for incorporating this motion consistency, including:

Multi-step distillation: Applying the motion-appearance distillation at multiple steps during the diffusion process, as described in MLCM.
Stochastic consistency distillation: Introducing stochasticity into the consistency distillation, as proposed in Scott.
Incorporating motion transfer: Leveraging techniques from MotionMaster to transfer camera motion to the generated videos.
Combining with static-to-video distillation: Exploring the synergies between motion-appearance distillation and the static-to-video distillation approach used in Dancing Still Images to Video.

The authors conduct extensive experiments to evaluate the performance of the Motion Consistency Model, comparing it to state-of-the-art video diffusion approaches and demonstrating its ability to generate high-quality, consistent videos.

Critical Analysis

The Motion Consistency Model represents a significant advancement in the field of video diffusion, addressing some of the key challenges of generating coherent and realistic videos. By disentangling motion and appearance, the model is able to better capture the dynamics of the video, leading to improved quality and consistency.

However, the paper does not discuss the potential limitations of this approach, such as its scalability to longer or more complex videos, or its ability to handle diverse video content. Additionally, the paper does not explore the potential biases or ethical considerations that may arise from the use of such a powerful video generation system.

Further research could investigate the generalization capabilities of the Motion Consistency Model, exploring its performance on a wider range of video datasets and use cases. Incorporating StoryDiffusion's long-range self-attention mechanisms could also help improve the model's ability to capture complex video narratives.

Overall, the Motion Consistency Model represents an exciting development in the field of video diffusion, with the potential to enable more realistic and coherent video generation. However, it is important to continue exploring the model's limitations and potential ethical implications to ensure its responsible and beneficial deployment.

Conclusion

The Motion Consistency Model introduced in this paper represents a significant advancement in the field of video diffusion, addressing the challenge of generating high-quality, consistent videos. By disentangling motion and appearance during the diffusion process and leveraging a motion-appearance distillation technique, the model is able to produce videos with improved quality and coherence.

The authors explore various strategies for incorporating motion consistency, drawing insights from related work in areas such as multi-step distillation, stochastic consistency distillation, and motion transfer. These approaches aim to further enhance the model's ability to capture the dynamics of video content.

While the Motion Consistency Model shows promising results, future research should explore its scalability, generalization capabilities, and potential ethical considerations. Incorporating insights from related work, such as StoryDiffusion's long-range self-attention mechanisms, could also help to expand the model's capabilities in generating more complex and narratively-coherent videos.

Overall, the Motion Consistency Model represents an important step forward in the field of video diffusion, with the potential to enable more realistic and engaging video generation for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MLCM: Multistep Consistency Distillation of Latent Diffusion Model

Qingsong Xie, Zhenyi Liao, Chen chen, Zhijie Deng, Shixiang Tang, Haonan Lu

Distilling large latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face a dilemma where they either (i) depend on multiple individual distilled models for different sampling budgets, or (ii) sacrifice generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8) sampling steps. To address these, we extend the recent multistep consistency distillation (MCD) strategy to representative LDMs, establishing the Multistep Latent Consistency Models (MLCMs) approach for low-cost high-quality image synthesis. MLCM serves as a unified model for various sampling steps due to the promise of MCD. We further augment MCD with a progressive training strategy to strengthen inter-segment consistency to boost the quality of few-step generations. We take the states from the sampling trajectories of the teacher model as training data for MLCMs to lift the requirements for high-quality training datasets and to bridge the gap between the training and inference of the distilled model. MLCM is compatible with preference learning strategies for further improvement of visual quality and aesthetic appeal. Empirically, MLCM can generate high-quality, delightful images with only 2-8 sampling steps. On the MSCOCO-2017 5K benchmark, MLCM distilled from SDXL gets a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 with only 4 steps, substantially surpassing 4-step LCM [23], 8-step SDXL-Lightning [17], and 8-step HyperSD [33]. We also demonstrate the versatility of MLCMs in applications including controllable generation, image style transfer, and Chinese-to-image generation.

6/13/2024

cs.CV cs.AI

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

Hongjian Liu, Qingsong Xie, Zhijie Deng, Chen Chen, Shixiang Tang, Fueyang Fu, Zheng-jun Zha, Haonan Lu

The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality generations can be achieved with just 1-2 sampling steps, and further improvements can be obtained by adding additional steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pretrained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the sample quality with rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID (Frechet Inceptio Distance) of 22.1, surpassing that (23.4) of the 1-step InstaFlow (Liu et al., 2023) and matching that of 4-step UFOGen (Xue et al., 2023b). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation (Luo et al., 2023a), with up to 16% improvement in a qualified metric. The code and checkpoints are coming soon.

4/16/2024

cs.CV

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

Nikita Starodubcev, Mikhail Khoroshikh, Artem Babenko, Dmitry Baranchuk

Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.

6/27/2024

cs.CV

❗

Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement

Ziyu Wang, Yue Xu, Cewu Lu, Yong-Lu Li

Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation, and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with a notably smaller memory storage budget. Our code is available at https://github.com/yuz1wan/video_distillation.

4/16/2024

cs.CV cs.LG