Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

2404.09499

Published 4/16/2024 by Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao, Weiwei Xu

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Abstract

Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos.

Create account to get full access

Overview

• This paper presents a method for learning human motion from monocular videos using a cross-modal manifold alignment approach.

• The proposed technique aims to recover 3D human poses from 2D video input by aligning the motion manifold of the video data with that of a 3D motion capture dataset.

• This allows the system to infer 3D pose information from the 2D video, without requiring multi-view cameras or depth sensors often needed for direct 3D human pose estimation (Direct 3D HPE Methods).

Plain English Explanation

The paper describes a way to estimate the 3D movements of a person from a regular 2D video, without needing extra equipment like multiple cameras or depth sensors. The key idea is to align, or match up, the motion patterns in the 2D video data with the 3D motion data from a pre-existing database.

This cross-modal manifold alignment allows the system to infer the 3D pose information from just the 2D video input. So it can figure out the full 3D movements of a person, even though the original video was only 2D.

This is useful because getting 3D motion data usually requires specialized equipment like multiple cameras or depth sensors (Direct 3D HPE Methods). The approach in this paper lets you estimate 3D motion from more common 2D video data.

Technical Explanation

The paper introduces a cross-modal manifold alignment technique to recover 3D human poses from 2D video inputs. The key idea is to learn a shared latent space that aligns the motion manifold of the 2D video data with the 3D motion capture dataset.

This is achieved through a neural network architecture that takes 2D video frames as input and outputs the corresponding 3D pose parameters. The network is trained using both 2D video data and a paired 3D motion capture dataset, allowing it to learn the cross-modal mapping between the 2D and 3D motion manifolds.

The proposed method avoids the need for direct 3D human pose estimation (Direct 3D HPE Methods), which typically requires multi-view cameras or depth sensors. Instead, it leverages the abundant 2D video data to infer the 3D pose information through the learned cross-modal alignment.

Critical Analysis

The paper acknowledges several limitations of the proposed approach. First, it relies on the availability of a 3D motion capture dataset paired with the 2D video data, which may not always be readily accessible.

Additionally, the cross-modal alignment assumes that the motion manifolds of the 2D and 3D data are well-aligned, which may not hold true in all cases, particularly for more complex or unconventional motion patterns. Further research may be needed to improve the robustness of the manifold alignment process.

Another potential issue is the sensitivity of the method to occlusions or missing data in the 2D video input, which could degrade the accuracy of the 3D pose estimation. Exploring ways to make the system more robust to such challenges could be an area for future work.

Conclusion

This paper presents a novel approach for learning 3D human motion from monocular 2D video inputs, using a cross-modal manifold alignment technique. The method allows for 3D pose estimation without the need for specialized multi-view or depth-sensing hardware, which is often required for direct 3D human pose estimation (Direct 3D HPE Methods).

The proposed system has the potential to enable a wide range of applications, from motion-driven animation and video customization to human-robot interaction and motion forecasting, where 3D human motion data is crucial but may not be readily available. Further research to address the identified limitations and improve the robustness of the approach could lead to even more impactful applications in the field of co-speech gesture video generation and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

Qing Yu, Mikihiro Tanaka, Kent Fujiwara

To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce motion patches, a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.

5/9/2024

cs.CV

🔮

Multimodal Sense-Informed Prediction of 3D Human Motions

Zhenyu Lou, Qiongjie Cui, Haofan Wang, Xu Tang, Hong Zhou

Predicting future human pose is a fundamental application for machine intelligence, which drives robots to plan their behavior and paths ahead of time to seamlessly accomplish human-robot collaboration in real-world 3D scenarios. Despite encouraging results, existing approaches rarely consider the effects of the external scene on the motion sequence, leading to pronounced artifacts and physical implausibilities in the predictions. To address this limitation, this work introduces a novel multi-modal sense-informed motion prediction approach, which conditions high-fidelity generation on two modal information: external 3D scene, and internal human gaze, and is able to recognize their salience for future human activity. Furthermore, the gaze information is regarded as the human intention, and combined with both motion and scene features, we construct a ternary intention-aware attention to supervise the generation to match where the human wants to reach. Meanwhile, we introduce semantic coherence-aware attention to explicitly distinguish the salient point clouds and the underlying ones, to ensure a reasonable interaction of the generated sequence with the 3D scene. On two real-world benchmarks, the proposed method achieves state-of-the-art performance both in 3D human pose and trajectory prediction.

5/7/2024

cs.CV

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Zichen Geng, Caren Han, Zeeshan Hayder, Jian Liu, Mubarak Shah, Ajmal Mian

Text-driven human motion generation is an emerging task in animation and humanoid robot design. Existing algorithms directly generate the full sequence which is computationally expensive and prone to errors as it does not pay special attention to key poses, a process that has been the cornerstone of animation for decades. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text by first generating keyframes followed by in-filling. We use a Variational Autoencoder (VAE) with Kullback-Leibler regularization to project the keyframes into a latent space to reduce dimensionality and further accelerate the subsequent diffusion process. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the keyframe latents and text condition. To complete the motion sequence, we propose a text-guided Transformer designed to perform motion-in-filling, ensuring the preservation of both fidelity and adherence to the physical constraints of human motion. Experiments show that our method achieves state-of-theart results on the HumanML3D dataset outperforming others on all R-precision metrics and MultiModal Distance. KeyMotion also achieves competitive performance on the KIT dataset, achieving the best results on Top3 R-precision, FID, and Diversity metrics.

5/27/2024

cs.CV cs.AI

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang, Benyou Wang, Ruimao Zhang, Lei Zhang

This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.

5/31/2024

cs.CV