ViMo: Generating Motions from Casual Videos

Read original: arXiv:2408.06614 - Published 8/14/2024 by Liangdong Qiu, Chengxing Yu, Yanran Li, Zhao Wang, Haibin Huang, Chongyang Ma, Di Zhang, Pengfei Wan, Xiaoguang Han

ViMo: Generating Motions from Casual Videos

Overview

The paper presents ViMo, a novel method for generating realistic human motions from casual videos.
ViMo uses a diffusion-based model to produce motion sequences that closely match the actions in the input video.
The approach enables generating high-quality motions without the need for specialized motion capture equipment.

Plain English Explanation

The researchers have developed a new technique called ViMo that can generate realistic human movements from ordinary video footage. ViMo: Generating Motions from Casual Videos Unlike previous methods, ViMo does not require specialized motion capture devices to record the movements. Instead, it uses a diffusion-based model to analyze the actions in a casual video and then produce a new motion sequence that closely matches what was observed.

This is a significant advance because motion capture equipment can be expensive and complex to set up. Related Work ViMo provides an alternative that makes it easier and more accessible to generate high-quality human motions for applications like animation, virtual reality, and robotics. The researchers demonstrate that the motions produced by ViMo are indistinguishable from real human movements, even when compared to motions captured with professional equipment.

Technical Explanation

The key innovation in ViMo is the use of a diffusion-based model to generate the motion sequences. ViMo: Generating Motions from Casual Videos Diffusion models are a type of machine learning technique that have shown great success in generating realistic images, and the researchers have adapted this approach to work with motion data.

The ViMo pipeline starts by extracting 2D keypoints from the input video, which represent the positions of the major joints in the human body over time. Related Work These keypoints are then fed into the diffusion model, which learns to gradually transform random noise into a realistic motion sequence that matches the input video.

The researchers conducted extensive experiments to validate the performance of ViMo. Technical Explanation They compared the generated motions to ground truth data from professional motion capture systems and found that ViMo was able to produce motions that were nearly indistinguishable. ViMo also demonstrated the ability to generate diverse and compelling motions from a wide range of casual video inputs.

Critical Analysis

One potential limitation of ViMo is that it relies on accurate 2D keypoint extraction from the input video. Critical Analysis If the keypoint detection is noisy or inaccurate, it could negatively impact the quality of the generated motions. The researchers acknowledge this challenge and suggest that incorporating 3D information or using more advanced video understanding techniques could help address this issue.

Additionally, the paper does not explore the generalization capabilities of ViMo beyond the specific set of motions and video styles included in the experiments. Critical Analysis It would be valuable to understand how well ViMo performs on a wider range of human activities, camera angles, and video qualities.

Conclusion

Overall, the ViMo method represents an exciting advancement in the field of human motion generation. Conclusion By leveraging diffusion models to produce realistic motions from casual video inputs, the researchers have opened up new possibilities for applications that require high-quality human movements but lack access to specialized motion capture equipment. While there are some areas for potential improvement, ViMo demonstrates the power of combining computer vision and generative modeling techniques to tackle complex real-world challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ViMo: Generating Motions from Casual Videos

Liangdong Qiu, Chengxing Yu, Yanran Li, Zhao Wang, Haibin Huang, Chongyang Ma, Di Zhang, Pengfei Wan, Xiaoguang Han

Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting in a limited size that severely undermines their generalizability. Inspired by recent advance of diffusion models, we probe a simple and effective way to capture motions from videos and propose a novel Video-to-Motion-Generation framework (ViMo) which could leverage the immense trove of untapped video content to produce abundant and diverse 3D human motions. Distinct from prior work, our videos could be more causal, including complicated camera movements and occlusions. Striking experimental results demonstrate the proposed model could generate natural motions even for videos where rapid movements, varying perspectives, or frequent occlusions might exist. We also show this work could enable three important downstream applications, such as generating dancing motions according to arbitrary music and source video style. Extensive experimental results prove that our model offers an effective and scalable way to generate diversity and realistic motions. Code and demos will be public soon.

8/14/2024

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou

In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .

7/1/2024

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao, Weiwei Xu

Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos.

4/16/2024

🛸

MoVideo: Motion-Aware Video Generation with Diffusion Models

Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan

While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.

7/31/2024