FIFO-Diffusion: Generating Infinite Videos from Text without Training

2405.11473

Published 6/13/2024 by Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Abstract

We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. Practically, FIFO-Diffusion consumes a constant amount of memory regardless of the target video length given a baseline model, while well-suited for parallel inference on multiple GPUs. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. Generated video samples and source codes are available at our project page.

Create account to get full access

Overview

This paper proposes a novel approach called FIFO-Diffusion for generating infinite videos from text without any training.
The method leverages a diffusion model to generate new video frames by conditioning on the previous frame and textual prompts.
The approach aims to enable the creation of continuously evolving videos that can be extended indefinitely, in contrast to typical video generation models that produce fixed-length outputs.

Plain English Explanation

FIFO-Diffusion is a technique for generating endless videos from text descriptions. It works by using a special type of machine learning model called a diffusion model. This model takes the previous video frame and the text prompt as inputs, and then generates the next frame in the sequence.

The key insight behind FIFO-Diffusion is that you don't need to train the model on a large dataset of videos. Instead, it can generate new frames on the fly, one after the other, to create an infinite video. This is different from most video generation methods, which can only produce fixed-length outputs.

The BiVDiff: A Training-Free Framework for General-Purpose Video Generation and Frame Interpolation via Consecutive Brownian Bridge Diffusion papers explored related ideas, but FIFO-Diffusion takes this concept further by allowing the video to continue evolving indefinitely based on the text prompt.

Technical Explanation

The core of FIFO-Diffusion is a diffusion model that takes the previous video frame and a text prompt as input, and then generates the next frame in the sequence. This diffusion model is not trained on a dataset of videos, but instead learns to generate new frames on the fly.

The process works as follows:

The user provides an initial text prompt describing the desired video content.
The diffusion model generates the first video frame based on this prompt.
For each subsequent frame, the model takes the previous frame and the text prompt as input, and generates the next frame in the sequence.
This process continues indefinitely, allowing the video to evolve continuously based on the text description.

The key innovations in this work include:

A novel application of diffusion models to the task of infinite video generation, rather than the more typical fixed-length video generation.
Leveraging the TI2V: Zero-Shot Text-Guided Image-to-Video Generation and LLM-Grounded Video Diffusion Models techniques to condition the diffusion model on text prompts.
Building on the Motion-Aware Latent Diffusion Models for Video Frame Synthesis approach to generate high-quality video frames.

Critical Analysis

One potential limitation of FIFO-Diffusion is that the video may eventually diverge from the original text prompt as more frames are generated. The model does not have a mechanism to "remember" the initial prompt and ensure the video remains consistent with it over time.

Additionally, the paper does not provide extensive evaluation of the visual quality and coherence of the generated videos. It would be helpful to see comparisons to other state-of-the-art video generation methods to better understand the strengths and weaknesses of the FIFO-Diffusion approach.

Further research could explore ways to introduce feedback loops or other mechanisms to keep the generated video aligned with the original text prompt, or to provide more fine-grained control over the evolution of the video over time.

Conclusion

FIFO-Diffusion is a novel approach that enables the generation of infinite, text-guided videos without the need for extensive training on video datasets. By leveraging diffusion models, the technique can create continuously evolving video sequences that are not limited to a fixed length.

While the paper presents a promising new direction for video generation, further research is needed to address potential limitations around prompt coherence and to more thoroughly evaluate the visual quality and capabilities of the approach. Nonetheless, FIFO-Diffusion represents an exciting step towards more flexible and open-ended video generation, with potential applications in areas like interactive entertainment, creative expression, and data visualization.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He

Video generation has made remarkable progress in recent years, especially since the advent of the video diffusion models. Many video generation models can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD). However, most video models can only generate low frame rate videos due to the limited GPU memory as well as the difficulty of modeling a large set of frames. The training videos are always uniformly sampled at a specified interval for temporal compression. Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model. In this paper, we propose a training-free video interpolation method for generative video diffusion models, which is generalizable to different models in a plug-and-play manner. We investigate the non-linearity in the feature space of video diffusion models and transform a video model into a self-cascaded video diffusion model with incorporating the designed hidden state correction modules. The self-cascaded architecture and the correction module are proposed to retain the temporal consistency between key frames and the interpolated frames. Extensive evaluations are preformed on multiple popular video models to demonstrate the effectiveness of the propose method, especially that our training-free method is even comparable to trained interpolation models supported by huge compute resources and large-scale datasets.

6/4/2024

cs.CV

Online Continual Learning of Video Diffusion Models From a Single Video Stream

Jason Yoo, Dylan Green, Geoff Pleiss, Frank Wood

Diffusion models have shown exceptional capabilities in generating realistic videos. Yet, their training has been predominantly confined to offline environments where models can repeatedly train on i.i.d. data to convergence. This work explores the feasibility of training diffusion models from a semantically continuous video stream, where correlated video frames sequentially arrive one at a time. To investigate this, we introduce two novel continual video generative modeling benchmarks, Lifelong Bouncing Balls and Windows 95 Maze Screensaver, each containing over a million video frames generated from navigating stationary environments. Surprisingly, our experiments show that diffusion models can be effectively trained online using experience replay, achieving performance comparable to models trained with i.i.d. samples given the same number of gradient steps.

6/10/2024

cs.CV cs.LG

🖼️

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang

Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well preserved. Finally, these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as {bf BIVDiff}, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models (e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video synthesis tasks, including controllable video generation, video editing, video inpainting, and outpainting.

4/10/2024

cs.CV cs.AI

Controllable Longer Image Animation with Diffusion Models

Qiang Wang, Minghua Liu, Junjun Hu, Fan Jiang, Mu Xu

Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: https://wangqiang9.github.io/Controllable.github.io/

5/29/2024

cs.CV