World-Grounded Human Motion Recovery via Gravity-View Coordinates

Read original: arXiv:2409.06662 - Published 9/11/2024 by Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, Xiaowei Zhou

World-Grounded Human Motion Recovery via Gravity-View Coordinates

Overview

The paper presents a method for recovering 3D human motion in the real world using a single camera
It introduces "gravity-view coordinates" to represent human poses in a way that is aligned with the real-world orientation
This allows the 3D motion to be recovered more accurately than previous approaches

Plain English Explanation

The researchers have developed a new way to track the 3D movement of people using just a single camera. Previous methods often struggled to accurately represent the person's motion in the real world, but this new approach uses "gravity-view coordinates" to better align the 3D pose with the actual orientation in the physical environment.

By representing the human pose in a way that takes into account the direction of gravity, the system can more precisely recover the full 3D motion of the person being filmed. This could be useful for applications like animation, virtual reality, or even monitoring human movements for healthcare purposes.

Technical Explanation

The key innovation in this paper is the use of "gravity-view coordinates" to represent the 3D human pose. Previous methods would estimate the 3D pose without considering the real-world orientation, leading to less accurate results.

In contrast, the gravity-view coordinate system aligns the pose with the direction of gravity, allowing the 3D motion to be recovered more faithfully. The system first estimates the 3D location and orientation of the camera relative to the ground plane. It then uses this information to transform the estimated 3D pose into the gravity-view coordinate system.

The researchers evaluated their approach on several benchmarks and found that it outperformed previous state-of-the-art methods for 3D human motion capture from monocular video. The method was particularly effective at handling complex movements and scenes with clutter or occlusions.

Critical Analysis

The paper presents a promising approach for accurately recovering 3D human motion from a single camera. However, the authors acknowledge some limitations:

The method relies on accurately estimating the camera's position and orientation relative to the ground plane, which may be challenging in some environments
The approach was evaluated on mostly indoor scenes, and its performance on outdoor or more complex environments is not yet clear
The system currently processes each frame independently, which could lead to temporal inconsistencies in the recovered motion

Future work could address these limitations, for example by incorporating temporal information or developing more robust camera pose estimation techniques. Additionally, further evaluations on more diverse datasets would help assess the method's broader applicability.

Conclusion

This paper introduces a novel approach for 3D human motion capture that uses gravity-view coordinates to better align the recovered poses with the real-world environment. By considering the direction of gravity, the system can more accurately reconstruct the full 3D motion of a person from a single camera. This advance could enable improved applications in areas like animation, virtual reality, and healthcare monitoring.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

World-Grounded Human Motion Recovery via Gravity-View Coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, Xiaowei Zhou

We present a novel method for recovering world-grounded human motion from monocular video. The main challenge lies in the ambiguity of defining the world coordinate system, which varies between sequences. Previous approaches attempt to alleviate this issue by predicting relative motion in an autoregressive manner, but are prone to accumulating errors. Instead, we propose estimating human poses in a novel Gravity-View (GV) coordinate system, which is defined by the world gravity and the camera view direction. The proposed GV system is naturally gravity-aligned and uniquely defined for each video frame, largely reducing the ambiguity of learning image-pose mapping. The estimated poses can be transformed back to the world coordinate system using camera rotations, forming a global motion sequence. Additionally, the per-frame estimation avoids error accumulation in the autoregressive methods. Experiments on in-the-wild benchmarks demonstrate that our method recovers more realistic motion in both the camera space and world-grounded settings, outperforming state-of-the-art methods in both accuracy and speed. The code is available at https://zju3dv.github.io/gvhmr/.

9/11/2024

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Soyong Shin, Juyong Kim, Eni Halilaj, Michael J. Black

The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes at http://wham.is.tue.mpg.de/

4/22/2024

GazeMotion: Gaze-guided Human Motion Forecasting

Zhiming Hu, Syn Schmitt, Daniel Haeufle, Andreas Bulling

We present GazeMotion, a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past poses into a gaze-pose graph, and finally uses a residual graph convolutional network to forecast body motion. We extensively evaluate our method on the MoGaze, ADT, and GIMO benchmark datasets and show that it outperforms state-of-the-art methods by up to 7.4% improvement in mean per joint position error. Using head direction as a proxy to gaze, our method still achieves an average improvement of 5.5%. We finally report an online user study showing that our method also outperforms prior methods in terms of perceived realism. These results show the significant information content available in eye gaze for human motion forecasting as well as the effectiveness of our method in exploiting this information.

7/12/2024

🎯

Free-Moving Object Reconstruction and Pose Estimation with Virtual Camera

Haixin Shi, Yinlin Hu, Daniel Koguciuk, Juan-Ting Lin, Mathieu Salzmann, David Ferstl

We propose an approach for reconstructing free-moving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence globally without any segments. We progressively optimize the object shape and pose simultaneously based on an implicit neural representation. A key aspect of our method is a virtual camera system that reduces the search space of the optimization significantly. We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information.

5/13/2024