MotionBooth: Motion-Aware Customized Text-to-Video Generation

Read original: arXiv:2406.17758 - Published 8/22/2024 by Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Overview

MotionBooth is a motion-aware text-to-video generation system that allows for customized video output.
It leverages motion-aware conditioning to generate videos that match a user's specified text prompts.
MotionBooth aims to provide a more personalized and controllable video generation experience compared to existing approaches.

Plain English Explanation

MotionBooth is a new AI system that can create customized videos based on text prompts. Unlike previous text-to-video generation methods, MotionBooth takes into account the motion and movement required to bring the text prompt to life. This allows it to generate videos that match the user's specific instructions more closely.

For example, if a user asks MotionBooth to create a video of "a person dancing to upbeat music," the system will not only generate the appropriate visuals, but also ensure the movements and motions of the dancer match the energetic nature of the music. This level of personalization and control is a key advantage of MotionBooth over existing approaches.

Technical Explanation

MotionBooth uses a novel motion-aware conditioning approach to generate customized text-to-video outputs. The system takes in a text prompt from the user and a set of reference videos that capture the desired motion characteristics. It then uses these inputs to generate a video that matches the text prompt while also reflecting the motion patterns from the reference videos.

The core technical innovation of MotionBooth is its ability to disentangle the content (what is shown) from the motion (how it is shown) in the video generation process. This allows the system to better align the generated video's movements with the user's textual instructions. The authors demonstrate the effectiveness of this approach through extensive experiments and comparisons to previous state-of-the-art text-to-video models.

Critical Analysis

The MotionBooth paper presents a compelling approach to text-to-video generation that addresses some key limitations of prior work. By incorporating motion-aware conditioning, the system is able to generate videos that are more personalized and controllable from the user's perspective.

However, the paper does not fully explore the potential limitations or edge cases of this approach. For instance, it is unclear how MotionBooth would handle highly abstract or complex text prompts that do not have clear corresponding motion patterns in the reference videos. Additionally, the paper does not discuss the computational and resource requirements of the system, which could be an important practical consideration for real-world deployment.

Further research could explore ways to make MotionBooth more robust and generalizable, such as by incorporating additional modalities (e.g., audio) or developing more sophisticated motion modeling techniques. Comparing MotionBooth to human-created videos in terms of realism and quality could also provide valuable insights.

Conclusion

MotionBooth represents an important step forward in text-to-video generation by introducing motion-aware conditioning as a means to create more personalized and controllable video outputs. While the current implementation has some limitations, the core concept and technical approach demonstrate the potential for AI systems to generate customized video content that closely aligns with users' specific preferences and requirements. As the field of generative AI continues to advance, innovations like MotionBooth will likely play a crucial role in making video creation more accessible and tailored to individual needs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MotionBooth: Motion-Aware Customized Text-to-Video Generation

Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen

In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth

8/22/2024

🛸

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, Jing Liao

Recent text-to-video diffusion models have achieved impressive progress. In practice, users often desire the ability to control object motion and camera movement independently for customized video creation. However, current methods lack the focus on separately controlling object motion and camera movement in a decoupled manner, which limits the controllability and flexibility of text-to-video models. In this paper, we introduce Direct-a-Video, a system that allows users to independently specify motions for multiple objects as well as camera's pan and zoom movements, as if directing a video. We propose a simple yet effective strategy for the decoupled control of object motion and camera movement. Object motion is controlled through spatial cross-attention modulation using the model's inherent priors, requiring no additional optimization. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. We further employ an augmentation-based approach to train these layers in a self-supervised manner on a small-scale dataset, eliminating the need for explicit motion annotation. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios. Extensive experiments demonstrate the superiority and effectiveness of our method. Project page and code are available at https://direct-a-video.github.io/.

5/7/2024

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin

Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

7/2/2024

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot video motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. To disentangle the spatial and temporal information during training, we introduce a novel concept of appearance absorbers that detach the original appearance from the reference video prior to motion learning. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks such as custom video generation and editing, video appearance customization and multiple motion combination. Our project page can be found at https://customize-a-video.github.io.

8/29/2024