EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Read original: arXiv:2308.06493 - Published 9/9/2024 by Jiaxi Jiang, Paul Streli, Manuel Meier, Christian Holz

🐍

Overview

Full-body egocentric pose estimation from head and hand poses can power articulate avatar representations on VR/AR headsets.
Existing methods rely too much on the indoor motion-capture datasets they were trained on and assume continuous motion capture and uniform body dimensions.
This paper proposes EgoPoser, a new approach that overcomes these limitations.

Plain English Explanation

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Head and Hand Poses Alone is a research paper that presents a new method for estimating a person's full-body pose using only the position and orientation of their head and hands. This is an important problem for powering realistic avatar representations in virtual and augmented reality (VR/AR) headsets.

Existing methods for full-body pose estimation often rely heavily on the indoor motion-capture environments where the training data was collected. They also assume that the person's body movements are continuously tracked and that everyone has a similar body shape. EgoPoser was designed to overcome these limitations.

The key ideas behind EgoPoser are:

Robustly model body pose from intermittent hand tracking: EgoPoser can estimate full-body pose even when the hands are only occasionally tracked within the headset's field of view.
Rethink input representations for headset-based pose estimation: EgoPoser uses a novel global motion decomposition method to predict full-body pose independently of the person's overall position.
Enhance pose estimation with an efficient SlowFast module: This module allows EgoPoser to capture longer motion sequences while maintaining high computational efficiency.
Generalize across different body shapes: EgoPoser's model can work well for people with a variety of body sizes and proportions, not just the specific individuals in the training data.

By addressing these limitations of existing methods, EgoPoser establishes a more robust baseline for full-body pose estimation that can scale to large-scale and unseen environments without relying on external motion capture systems.

Technical Explanation

EgoPoser is designed to estimate a person's full-body pose using only the tracked positions and orientations of their head and hands, which are readily available from VR/AR headset sensors.

The key technical innovations in EgoPoser include:

Robust body pose modeling from intermittent hand tracking: EgoPoser can infer the full-body pose even when the hands are only occasionally visible within the headset's field of view, unlike prior methods that require continuous hand tracking.
Novel global motion decomposition input representation: EgoPoser uses a novel input representation that separates the person's global position and orientation from their body articulation. This allows the model to predict full-body pose independently of the user's overall position and orientation.
Efficient SlowFast module for long-range motion modeling: EgoPoser employs an SlowFast module design, which captures both short-term and long-term motion dynamics efficiently. This enables the model to utilize longer temporal information for more accurate pose estimation.
Generalization across body shapes: EgoPoser's architecture is designed to generalize well to people with different body sizes and proportions, going beyond the specific individuals in the training dataset.

The authors evaluate EgoPoser extensively, showing that it outperforms state-of-the-art methods in both qualitative and quantitative metrics while maintaining a high inference speed of over 600 frames per second. This demonstrates EgoPoser's potential as a robust and practical solution for full-body pose estimation in VR/AR applications.

Critical Analysis

The EgoPoser paper presents a well-designed and comprehensive solution for the challenging problem of full-body pose estimation from head and hand tracking alone. Some key strengths of the work include:

Addressing practical limitations of existing methods: EgoPoser specifically targets the real-world constraints of VR/AR applications, such as intermittent hand tracking and diverse body shapes, which prior methods have often overlooked.
Novel technical contributions: The global motion decomposition input representation and efficient SlowFast module are innovative approaches that enable EgoPoser's improved performance.
Thorough experimental evaluation: The authors provide detailed quantitative and qualitative results, benchmarking against state-of-the-art methods and demonstrating EgoPoser's advantages.

However, the paper also acknowledges some limitations and avenues for future work:

Evaluation on more diverse datasets: The experiments were primarily conducted on existing indoor motion capture datasets, which may not fully capture the range of environments and body types encountered in real-world VR/AR applications.
Potential for further performance optimization: While EgoPoser maintains a high inference speed, there may be opportunities to further streamline the architecture or leverage specialized hardware for even faster processing.
Incorporation of additional sensor modalities: The current approach relies only on head and hand tracking, but integrating other egocentric sensors (e.g., inertial measurement units) could potentially enhance the pose estimation accuracy.

Overall, the EgoPoser paper presents a significant advance in the field of full-body egocentric pose estimation, establishing a strong baseline for future research in this area. The authors have successfully addressed several key limitations of prior work, paving the way for more robust and practical pose estimation solutions for VR/AR applications.

Conclusion

EgoPoser is a novel approach for estimating a person's full-body pose using only the tracked positions and orientations of their head and hands. By overcoming the limitations of existing methods, which rely heavily on indoor motion capture data and continuous joint tracking, EgoPoser represents a significant advancement in the field of egocentric pose estimation.

The key innovations in EgoPoser include its robust modeling of body pose from intermittent hand tracking, novel global motion decomposition input representation, efficient SlowFast module for long-range motion capture, and ability to generalize across different body shapes. Experimental results show that EgoPoser outperforms state-of-the-art methods while maintaining a high inference speed, making it a promising solution for powering articulate avatar representations in VR/AR headset-based platforms.

As the use of VR/AR technologies continues to grow, the ability to accurately estimate full-body pose from egocentric sensors will become increasingly important. EgoPoser establishes a strong foundation for future research in this area, paving the way for more robust and practical pose estimation systems that can scale to diverse real-world environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🐍

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Jiaxi Jiang, Paul Streli, Manuel Meier, Christian Holz

Full-body egocentric pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representations on headset-based platforms. However, existing methods over-rely on the indoor motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous joint motion capture and uniform body dimensions. We propose EgoPoser to overcome these limitations with four main contributions. 1) EgoPoser robustly models body pose from intermittent hand position and orientation tracking only when inside a headset's field of view. 2) We rethink input representations for headset-based ego-pose estimation and introduce a novel global motion decomposition method that predicts full-body pose independent of global positions. 3) We enhance pose estimation by capturing longer motion time series through an efficient SlowFast module design that maintains computational efficiency. 4) EgoPoser generalizes across various body shapes for different users. We experimentally evaluate our method and show that it outperforms state-of-the-art methods both qualitatively and quantitatively while maintaining a high inference speed of over 600fps. EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.

9/9/2024

🏷️

3D Human Pose Perception from Egocentric Stereo Videos

Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.

5/16/2024

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

Chenhongyi Yang, Anastasia Tkach, Shreyas Hampali, Linguang Zhang, Elliot J. Crowley, Cem Keskin

We present EgoPoseFormer, a simple yet effective transformer-based model for stereo egocentric human pose estimation. The main challenge in egocentric pose estimation is overcoming joint invisibility, which is caused by self-occlusion or a limited field of view (FOV) of head-mounted cameras. Our approach overcomes this challenge by incorporating a two-stage pose estimation paradigm: in the first stage, our model leverages the global information to estimate each joint's coarse location, then in the second stage, it employs a DETR style transformer to refine the coarse locations by exploiting fine-grained stereo visual features. In addition, we present a Deformable Stereo Attention operation to enable our transformer to effectively process multi-view features, which enables it to accurately localize each joint in the 3D world. We evaluate our method on the stereo UnrealEgo dataset and show it significantly outperforms previous approaches while being computationally efficient: it improves MPJPE by 27.4mm (45% improvement) with only 7.9% model parameters and 13.1% FLOPs compared to the state-of-the-art. Surprisingly, with proper training settings, we find that even our first-stage pose proposal network can achieve superior performance compared to previous arts. We also show that our method can be seamlessly extended to monocular settings, which achieves state-of-the-art performance on the SceneEgo dataset, improving MPJPE by 25.5mm (21% improvement) compared to the best existing method with only 60.7% model parameters and 36.4% FLOPs. Code is available at: https://github.com/ChenhongyiYang/egoposeformer .

8/16/2024

Ultra Inertial Poser: Scalable Motion Capture and Tracking from Sparse Inertial Sensors and Ultra-Wideband Ranging

Rayan Armani, Changlin Qian, Jiaxi Jiang, Christian Holz

While camera-based capture systems remain the gold standard for recording human motion, learning-based tracking systems based on sparse wearable sensors are gaining popularity. Most commonly, they use inertial sensors, whose propensity for drift and jitter have so far limited tracking accuracy. In this paper, we propose Ultra Inertial Poser, a novel 3D full body pose estimation method that constrains drift and jitter in inertial tracking via inter-sensor distances. We estimate these distances across sparse sensor setups using a lightweight embedded tracker that augments inexpensive off-the-shelf 6D inertial measurement units with ultra-wideband radio-based ranging$-$dynamically and without the need for stationary reference anchors. Our method then fuses these inter-sensor distances with the 3D states estimated from each sensor Our graph-based machine learning model processes the 3D states and distances to estimate a person's 3D full body pose and translation. To train our model, we synthesize inertial measurements and distance estimates from the motion capture database AMASS. For evaluation, we contribute a novel motion dataset of 10 participants who performed 25 motion types, captured by 6 wearable IMU+UWB trackers and an optical motion capture system, totaling 200 minutes of synchronized sensor data (UIP-DB). Our extensive experiments show state-of-the-art performance for our method over PIP and TIP, reducing position error from $13.62$ to $10.65cm$ ($22%$ better) and lowering jitter from $1.56$ to $0.055km/s^3$ (a reduction of $97%$).

5/1/2024