DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

Read original: arXiv:2408.07481 - Published 8/15/2024 by Xiaojing Zhong, Xinyi Huang, Xiaofeng Yang, Guosheng Lin, Qingyao Wu

👨‍🏫

Overview

This paper presents a novel approach for generating co-speech gestures in videos using motion capture data.
The proposed method decouples the motion generation process into two steps: editing the motion space and time diffusion.
This allows for more flexible and controllable gesture generation compared to existing techniques.

Plain English Explanation

The paper discusses a new way to create videos where a person's movements (gestures) match what they are saying. This is done by using motion capture data, which records the precise movements of a person's body.

The researchers' approach breaks down the process of generating these gestures into two main steps:

Editing the motion space: This involves modifying the recorded motion data to create new, natural-looking gestures that fit the speech.
Time diffusion: This step smooths out the transitions between the different gestures to make the overall motion look fluid and continuous.

By separating the generation process into these two stages, the researchers were able to create a more flexible and controllable system for synthesizing co-speech gestures in videos. This allows for greater customization and more realistic-looking results compared to previous techniques.

Technical Explanation

The key innovation in this paper is the decoupling of the motion generation process into two steps: editing the motion space and time diffusion.

In the first step, the researchers develop a method to edit the motion capture data to generate new, natural-looking gestures that match the speech. This involves learning a mapping between the speech and the corresponding motion, and then using this to synthesize appropriate gestures.

The second step, time diffusion, focuses on smoothing out the transitions between the different gestures. This is done by modeling the temporal dynamics of the motion using a diffusion process, which helps create fluid, continuous movements.

By decoupling the motion generation into these two stages, the researchers were able to achieve more flexible and controllable results compared to previous approaches that treated the process as a single, end-to-end task.

Critical Analysis

The paper presents a compelling approach for generating co-speech gestures in videos, with a clear technical explanation and evaluation. However, the researchers acknowledge some limitations:

The method relies on high-quality motion capture data, which may not always be available or feasible to obtain.
The model was trained and evaluated on a limited dataset, so its performance on more diverse speech and gesture scenarios is unclear.
The paper does not address potential issues around the ethical use of this technology, such as the creation of fake or manipulated videos.

Additionally, the researchers could have explored more ways to make the gesture generation process even more flexible and controllable, such as allowing users to directly specify the desired gestures or providing more intuitive editing tools.

Conclusion

This paper introduces a novel decoupled approach for generating co-speech gestures in videos. By separating the process into motion space editing and time diffusion, the researchers were able to achieve more flexible and controllable results compared to previous techniques.

This work has the potential to significantly improve the realism and customization of generated videos, with applications in areas like virtual avatars, movie production, and video conferencing. However, the researchers will need to address the limitations around data requirements and ethical considerations to fully realize the potential of this technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👨‍🏫

DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

Xiaojing Zhong, Xinyi Huang, Xiaofeng Yang, Guosheng Lin, Qingyao Wu

Diffusion models usher a new era of video editing, flexibly manipulating the video contents with text prompts. Despite the widespread application demand in editing human-centered videos, these models face significant challenges in handling complex objects like humans. In this paper, we introduce DeCo, a novel video editing framework specifically designed to treat humans and the background as separate editable targets, ensuring global spatial-temporal consistency by maintaining the coherence of each individual component. Specifically, we propose a decoupled dynamic human representation that utilizes a parametric human body prior to generate tailored humans while preserving the consistent motions as the original video. In addition, we consider the background as a layered atlas to apply text-guided image editing approaches on it. To further enhance the geometry and texture of humans during the optimization, we extend the calculation of score distillation sampling into normal space and image space. Moreover, we tackle inconsistent lighting between the edited targets by leveraging a lighting-aware video harmonizer, a problem previously overlooked in decompose-edit-combine approaches. Extensive qualitative and numerical experiments demonstrate that DeCo outperforms prior video editing methods in human-centered videos, especially in longer videos.

8/15/2024

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Yi Zuo, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Shuyuan Yang, Yuwei Guo

Existing diffusion-based video editing methods have achieved impressive results in motion editing. Most of the existing methods focus on the motion alignment between the edited video and the reference video. However, these methods do not constrain the background and object content of the video to remain unchanged, which makes it possible for users to generate unexpected videos. In this paper, we propose a one-shot video motion editing method called Edit-Your-Motion that requires only a single text-video pair for training. Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to decouple spatio-temporal features in space-time diffusion models. DPL separates learning object content and motion into two training stages. In the first training stage, we focus on learning the spatial features (the features of object content) and breaking down the temporal relationships in the video frames by shuffling them. We further propose Recurrent-Causal Attention (RC-Attn) to learn the consistent content features of the object from unordered video frames. In the second training stage, we restore the temporal relationship in video frames to learn the temporal feature (the features of the background and object's motion). We also adopt the Noise Constraint Loss to smooth out inter-frame differences. Finally, in the inference stage, we inject the content features of the source object into the editing branch through a two-branch structure (editing branch and reconstruction branch). With Edit-Your-Motion, users can edit the motion of objects in the source video to generate more exciting and diverse videos. Comprehensive qualitative experiments, quantitative experiments and user preference studies demonstrate that Edit-Your-Motion performs better than other methods.

5/8/2024

🛸

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui

Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.

5/29/2024

Decomposition Betters Tracking Everything Everywhere

Rui Li, Dong Liu

Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions.

7/17/2024