MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

2406.19680

Published 7/1/2024 by Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Abstract

In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .

Create account to get full access

Overview

This paper introduces MimicMotion, a novel approach for generating high-quality human motion videos with confidence-aware pose guidance.
The key idea is to use a confidence map derived from the input pose to guide the video generation process, resulting in more realistic and coherent motion.
MimicMotion outperforms state-of-the-art methods in terms of both visual quality and motion realism, as demonstrated through extensive experiments.

Plain English Explanation

MimicMotion is a system that can generate realistic-looking videos of people moving and doing different actions. The main innovation is that it uses a "confidence map" to guide the video generation process.

The confidence map is a way of measuring how certain the system is about the input pose (the position and orientation of the person's body). areas of the pose that the system is more confident about are used to influence the video generation, leading to more realistic and coherent motion.

This approach outperforms other state-of-the-art methods, meaning it can create videos that look and move more naturally than what other systems can produce. This is useful for applications like animation, visual effects, and video games, where high-quality human motion is important.

Technical Explanation

The key technical elements of MimicMotion include:

Confidence-aware Pose Guidance: MimicMotion uses a confidence map derived from the input pose to guide the video generation process. This helps the system focus on the most reliable parts of the pose and generate more coherent and realistic motion.
Motion Generation Network: The core of MimicMotion is a deep neural network that takes the input pose and confidence map as input and generates the corresponding high-quality video output.
Multi-task Training: MimicMotion is trained on a combination of video generation, pose estimation, and confidence prediction tasks, which helps the system learn more robust and generalizable representations.
Evaluation: The authors conduct extensive experiments to compare MimicMotion against state-of-the-art methods, demonstrating significant improvements in both visual quality and motion realism.

Critical Analysis

The paper presents a well-designed and thorough study, with a clear technical approach and comprehensive evaluation. However, some potential limitations and areas for future research include:

The method is currently limited to generating videos of a single person; extending it to handle multiple interacting characters would be an interesting direction.
The confidence map is generated based on the input pose, but incorporating additional cues (e.g., from the video context) could potentially further improve the guidance.
While the paper demonstrates strong performance on standard benchmarks, real-world deployment may require addressing issues like robustness to varied environments and the ability to handle diverse motion styles.

Conclusion

MimicMotion represents an important advance in the field of human motion video generation, leveraging confidence-aware pose guidance to achieve high-quality and realistic results. The technical innovations and strong empirical performance suggest that this approach could have significant impact on various applications that require lifelike human motion, such as animation, virtual reality, and entertainment. As the field continues to progress, further research exploring the method's broader capabilities and real-world deployment will be valuable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Do As I Do: Pose Guided Human Motion Copy

Sifan Wu, Zhenguang Liu, Beibei Zhang, Roger Zimmermann, Zhongjie Ba, Xiaosong Zhang, Kui Ren

Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.

6/26/2024

cs.CV

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin

Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

7/2/2024

cs.CV

🛸

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui

Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.

5/29/2024

cs.CV cs.AI

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Zhikai Zhang, Yitang Li, Haofeng Huang, Mingxian Lin, Li Yi

Human motion synthesis is a fundamental task in computer animation. Despite recent progress in this field utilizing deep learning and motion capture data, existing methods are always limited to specific motion categories, environments, and styles. This poor generalizability can be partially attributed to the difficulty and expense of collecting large-scale and high-quality motion data. At the same time, foundation models trained with internet-scale image and text data have demonstrated surprising world knowledge and reasoning ability for various downstream tasks. Utilizing these foundation models may help with human motion synthesis, which some recent works have superficially explored. However, these methods didn't fully unveil the foundation models' potential for this task and only support several simple actions and environments. In this paper, we for the first time, without any motion data, explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs across any motion task and environment. Our framework can be split into two stages: 1) sequential keyframe generation by utilizing MLLMs as a keyframe designer and animator; 2) motion filling between keyframes through interpolation and motion tracking. Our method can achieve general human motion synthesis for many downstream tasks. The promising results demonstrate the worth of mocap-free human motion synthesis aided by MLLMs and pave the way for future research.

6/24/2024

cs.CV