MEVG: Multi-event Video Generation with Text-to-Video Models

Read original: arXiv:2312.04086 - Published 7/17/2024 by Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, Sangpil Kim

🛸

Overview

Introduces a new diffusion-based video generation method
Generates videos showing multiple events from user-provided individual sentences
Does not require a large-scale video dataset, using a pre-trained diffusion-based text-to-video model without fine-tuning

Plain English Explanation

This research presents a novel approach to generating videos that depict multiple events, starting from individual sentences provided by the user. The key innovation is that this method does not require a large dataset of videos to train on. Instead, it uses a pre-trained text-to-video diffusion model without any additional fine-tuning.

The core idea is to use a "last frame-aware diffusion process" to maintain visual coherence between consecutive video segments, each of which corresponds to a different event. This involves initializing the latent representation and adjusting the noise level to enhance the motion dynamics in the generated video. Additionally, the researchers found that iteratively updating the latent vectors while referring to all the preceding frames helps maintain a consistent global appearance across the video clip.

To handle dynamic text input for video generation, the authors developed a "prompt generator" that takes the user's text and converts it into multiple optimal prompts for the text-to-video diffusion model. This allows for greater flexibility in the types of videos that can be generated from natural language descriptions.

Technical Explanation

The paper introduces a diffusion-based video generation method that can create videos depicting multiple events from individual user-provided sentences. Unlike other video generation approaches, this method does not require a large dataset of videos for training.

The key technical innovations are:

Last Frame-Aware Diffusion Process: To preserve visual coherence between consecutive video segments, the method initializes the latent representation and adjusts the noise level to enhance the motion dynamics in the generated video.
Iterative Latent Vector Update: By referring to all the preceding frames when updating the latent vectors, the method maintains a consistent global appearance across the video clip.
Prompt Generator: A novel component that translates the user's text input into multiple optimal prompts for the text-to-video diffusion model, enabling greater flexibility in the types of videos that can be generated.

The researchers conducted extensive experiments and user studies, demonstrating that their proposed method outperforms other video generation models in terms of temporal coherency of content and semantics.

Critical Analysis

The paper presents a promising approach to video generation that addresses some limitations of existing methods. By leveraging a pre-trained diffusion-based text-to-video model without fine-tuning, the researchers have developed a more efficient and flexible system that can generate videos depicting multiple events from natural language descriptions.

However, the paper does not provide a detailed analysis of the computational and memory requirements of the proposed method, which could be an important consideration for real-world deployment. Additionally, the researchers mention that the method may struggle with accurately depicting certain types of dynamic events, such as those involving rapid motion or complex interactions. Further research and experimentation may be needed to address these limitations.

It would also be valuable to see a more comprehensive evaluation of the generated videos, including comparisons to ground truth data or human-created videos, to better assess the method's ability to capture realistic and meaningful content.

Conclusion

This research introduces a novel diffusion-based video generation approach that can create videos depicting multiple events from individual user-provided sentences. By leveraging a pre-trained text-to-video diffusion model and introducing several technical innovations, the method does not require a large-scale video dataset for training, making it more accessible and flexible than existing video generation systems.

The proposed approach demonstrates promising results in terms of temporal coherency and semantic consistency, opening up new avenues for more intuitive and engaging video creation tools. As the field of generative AI continues to advance, this work represents an important step towards more natural and versatile video generation capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

MEVG: Multi-event Video Generation with Text-to-Video Models

Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, Sangpil Kim

We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained diffusion-based text-to-video generative model without a fine-tuning process. Specifically, we propose a last frame-aware diffusion process to preserve visual coherence between consecutive videos where each video consists of different events by initializing the latent and simultaneously adjusting noise in the latent to enhance the motion dynamic in a generated video. Furthermore, we find that the iterative update of latent vectors by referring to all the preceding frames maintains the global appearance across the frames in a video clip. To handle dynamic text input for video generation, we utilize a novel prompt generator that transfers course text messages from the user into the multiple optimal prompts for the text-to-video diffusion model. Extensive experiments and user studies show that our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics. Video examples are available on our project page: https://kuai-lab.github.io/eccv2024mevg.

7/17/2024

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

Bing Li, Cheng Zheng, Wenxuan Zhu, Jinjie Mai, Biao Zhang, Peter Wonka, Bernard Ghanem

While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.

6/14/2024

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

9/4/2024

🤖

LLM-grounded Video Diffusion Models

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li

Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

5/7/2024