WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Read original: arXiv:2312.07531 - Published 4/22/2024 by Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Overview

Proposes a method called WHAM (World-grounded Humans with Accurate 3D Motion) to reconstruct 3D human poses and motions grounded in the real world
Leverages visual and spatial cues from camera views and 3D scene information to improve the accuracy and plausibility of human motion reconstruction
Aims to address limitations of prior approaches that relied on limited visual inputs or simplified scene assumptions

Plain English Explanation

The WHAM method aims to create more accurate and realistic 3D reconstructions of human movements by using information from multiple camera views and the surrounding 3D environment. Prior approaches to human motion reconstruction often relied on limited visual inputs or made simplifying assumptions about the scene. In contrast, WHAM leverages both visual cues from the camera and spatial information about the 3D environment to produce human poses and motions that are better grounded in the real world. This helps overcome the limitations of earlier methods and generates 3D human reconstructions that are more accurate and plausible.

Technical Explanation

The key innovation of the WHAM method is its ability to fuse visual and spatial cues from multiple sources to improve 3D human motion reconstruction. Specifically, WHAM takes input from multiple camera views of a person, as well as 3D scene information about the surrounding environment. It then uses this combined data to estimate the person's 3D pose and motion in a way that is better aligned with the real-world context.

WHAM's architecture includes several components:

<a href="https://aimodels.fyi/papers/arxiv/rohm-robust-human-motion-reconstruction-via-diffusion">Pose estimation</a> from individual camera views
Scene reconstruction to create a 3D model of the environment
Spatial grounding to align the human pose and motion with the 3D scene
Temporal consistency to ensure smooth and natural human movements over time

By integrating these different elements, WHAM is able to produce 3D human reconstructions that are more accurate, plausible, and coherent with the surrounding world compared to prior approaches that relied on more limited inputs.

Critical Analysis

The authors acknowledge some limitations of the WHAM method. For example, it relies on access to 3D scene information, which may not always be available in real-world applications. Additionally, the method may struggle with highly dynamic or occluded scenes where the environment is rapidly changing or the person is frequently obscured.

Further research could explore ways to make WHAM more robust to these challenging scenarios, such as by incorporating <a href="https://aimodels.fyi/papers/arxiv/learning-human-motion-from-monocular-videos-via">monocular video</a> or <a href="https://aimodels.fyi/papers/arxiv/robust-human-motion-forecasting-using-transformer-based">forecasting techniques</a> to better handle partial information. Integrating <a href="https://aimodels.fyi/papers/arxiv/3d-human-scan-moving-event-camera">event cameras</a> or other sensors could also help <a href="https://aimodels.fyi/papers/arxiv/improving-robustness-3d-human-pose-estimation-benchmark">improve the robustness</a> of the 3D human reconstruction.

Conclusion

The WHAM method represents an important advancement in the field of 3D human motion reconstruction by leveraging both visual and spatial cues to generate more accurate and plausible results. By grounding the human poses and motions in the real-world 3D environment, WHAM overcomes limitations of prior approaches and takes a step towards creating more realistic and useful 3D human models. Further research to address remaining challenges could lead to even more robust and versatile human motion reconstruction capabilities with broad applications in areas like animation, robotics, and human-computer interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black

The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/

4/22/2024

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Yufu Wang, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by a large margin from prior work. https://yufu-wang.github.io/tram4d/

9/4/2024

World-Grounded Human Motion Recovery via Gravity-View Coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, Xiaowei Zhou

We present a novel method for recovering world-grounded human motion from monocular video. The main challenge lies in the ambiguity of defining the world coordinate system, which varies between sequences. Previous approaches attempt to alleviate this issue by predicting relative motion in an autoregressive manner, but are prone to accumulating errors. Instead, we propose estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely reducing the ambiguity of learning image-pose mapping. The estimated poses can be transformed back to the world coordinate system using camera rotations, forming a global motion sequence. Additionally, the per-frame estimation avoids error accumulation in the autoregressive methods. Experiments on in-the-wild benchmarks demonstrate that our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed. The code is available at https://zju3dv.github.io/gvhmr/.

9/11/2024

OfCaM: Global Human Mesh Recovery via Optimization-free Camera Motion Scale Calibration

Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Angela Yao

Accurate camera motion estimation is critical to estimate human motion in the global space. A standard and widely used method for estimating camera motion is Simultaneous Localization and Mapping (SLAM). However, SLAM only provides a trajectory up to an unknown scale factor. Different from previous attempts that optimize the scale factor, this paper presents Optimization-free Camera Motion Scale Calibration (OfCaM), a novel framework that utilizes prior knowledge from human mesh recovery (HMR) models to directly calibrate the unknown scale factor. Specifically, OfCaM leverages the absolute depth of human-background contact joints from HMR predictions as a calibration reference, enabling the precise recovery of SLAM camera trajectory scale in global space. With this correctly scaled camera motion and HMR's local motion predictions, we achieve more accurate global human motion estimation. To compensate for scenes where we detect SLAM failure, we adopt a local-to-global motion mapping to fuse with previously derived motion to enhance robustness. Simple yet powerful, our method sets a new standard for global human mesh estimation tasks, reducing global human motion error by 60% over the prior SOTA while also demanding orders of magnitude less inference time compared with optimization-based methods.

7/2/2024