3D Human Pose Perception from Egocentric Stereo Videos

2401.00889

Published 5/16/2024 by Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

🏷️

Abstract

While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.

Create account to get full access

Overview

Head-mounted devices provide an egocentric view, which can lead to significant self-occlusions of the user, making it difficult to accurately estimate complex 3D human poses.
The researchers propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation by leveraging scene information and temporal context from the egocentric stereo videos.
The framework utilizes depth features from a 3D scene reconstruction module and human joint queries enhanced by temporal features of the video inputs.
The researchers introduce two new benchmark datasets, UnrealEgo2 and UnrealEgo-RW (RealWorld), to evaluate existing and upcoming methods for egocentric stereo 3D human pose estimation.

Plain English Explanation

Head-mounted devices, like virtual reality headsets, can provide an immersive, first-person view of the world. However, this "egocentric" perspective can also make it challenging to accurately capture the complex 3D movements and poses of the person wearing the device. The device itself can block the view of the person's body, making it hard to track their movements.

To address this issue, the researchers have developed a new AI-powered framework that can better estimate 3D human poses from these egocentric, stereo video inputs. The key innovations are:

Using depth information from a 3D scene reconstruction module to provide more context about the environment the person is moving in.
Incorporating temporal features from the video over time to better understand the person's movements and poses.

The researchers have also created two new benchmark datasets, UnrealEgo2 and UnrealEgo-RW, which provide a much larger and more diverse collection of egocentric stereo video footage for evaluating this type of technology. These new datasets will help drive progress in this area of research.

Overall, this work represents an important step forward in enabling more accurate 3D human pose estimation from the first-person, egocentric perspectives provided by head-mounted devices. This could have applications in areas like virtual reality, augmented reality, and human-computer interaction.

Technical Explanation

The researchers propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation. Their key innovations are:

Utilizing depth features from a 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames. This provides valuable contextual information about the environment the person is moving in.
Enhancing human joint queries with temporal features of the video inputs. This allows the model to better understand the dynamics of the person's movements over time.

The researchers introduce two new benchmark datasets, UnrealEgo2 and UnrealEgo-RW (RealWorld), which offer a much larger number of egocentric stereo views and a wider variety of human motions compared to existing datasets. This enables more comprehensive evaluation of existing and upcoming methods for egocentric 3D human pose estimation.

The researchers' extensive experiments show that their proposed approach significantly outperforms previous methods, including hybrid 3D human pose estimation from monocular video and uncertainty-aware 3D human pose estimation. This demonstrates the effectiveness of their framework in accurately estimating complex human poses, even in challenging scenarios like crouching and sitting.

Critical Analysis

The researchers acknowledge that their method still has some limitations, such as potential inaccuracies in the 3D scene reconstruction module and the need for further improvements in handling severe occlusions. Additionally, the new benchmark datasets, while more comprehensive than previous ones, may not fully capture the diversity of real-world egocentric scenarios.

One potential concern is the reliance on transformer-based architectures, which can be computationally expensive and require large amounts of training data. It would be interesting to see if the researchers' core ideas could be adapted to more efficient, lightweight models for practical deployment.

Furthermore, the researchers do not provide much insight into the potential ethical implications of their work, such as privacy concerns around the use of egocentric cameras or the potential for misuse of 3D human pose estimation technology. These are important considerations that should be addressed as the field progresses.

Overall, the researchers have made a valuable contribution to the field of egocentric 3D human pose estimation, but there is still room for further advancements, both in terms of technical performance and broader societal considerations.

Conclusion

The proposed transformer-based framework represents a significant step forward in addressing the challenges of 3D human pose estimation from egocentric stereo video. By leveraging scene information and temporal context, the researchers have been able to achieve more accurate pose estimates, even in complex scenarios.

The introduction of the UnrealEgo2 and UnrealEgo-RW benchmark datasets is also a valuable contribution, as it will enable more comprehensive evaluation of existing and upcoming methods in this area. As the field continues to evolve, it will be important to consider not only technical performance but also the broader implications of this technology, particularly in terms of privacy and ethical usage.

Overall, this work represents an important advancement in the field of egocentric 3D human pose estimation, with the potential to enable new applications in virtual reality, augmented reality, and human-computer interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Survey on 3D Egocentric Human Pose Estimation

Md Mushfiqur Azam, Kevin Desai

Egocentric human pose estimation aims to estimate human body poses and develop body representations from a first-person camera perspective. It has gained vast popularity in recent years because of its wide range of applications in sectors like XR-technologies, human-computer interaction, and fitness tracking. However, to the best of our knowledge, there is no systematic literature review based on the proposed solutions regarding egocentric 3D human pose estimation. To that end, the aim of this survey paper is to provide an extensive overview of the current state of egocentric pose estimation research. In this paper, we categorize and discuss the popular datasets and the different pose estimation models, highlighting the strengths and weaknesses of different methods by comparative analysis. This survey can be a valuable resource for both researchers and practitioners in the field, offering insights into key concepts and cutting-edge solutions in egocentric pose estimation, its wide-ranging applications, as well as the open problems with future scope.

4/19/2024

cs.CV

EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams

Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik

Monocular egocentric 3D human motion capture is a challenging and actively researched problem. Existing methods use synchronously operating visual sensors (e.g. RGB cameras) and often fail under low lighting and fast motions, which can be restricting in many applications involving head-mounted devices. In response to the existing limitations, this paper 1) introduces a new problem, i.e., 3D human motion capture from an egocentric monocular event camera with a fisheye lens, and 2) proposes the first approach to it called EventEgo3D (EE3D). Event streams have high temporal resolution and provide reliable cues for 3D human motion capture under high-speed human motions and rapidly changing illumination. The proposed EE3D framework is specifically tailored for learning with event streams in the LNES representation, enabling high 3D reconstruction accuracy. We also design a prototype of a mobile head-mounted device with an event camera and record a real dataset with event observations and the ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D demonstrates robustness and superior 3D accuracy compared to existing solutions across various challenging experiments while supporting real-time 3D pose update rates of 140Hz.

4/15/2024

cs.CV

📉

Instance Tracking in 3D Scenes from Egocentric Videos

Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes

Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods from relevant areas, e.g., single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Our experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches in the egocentric setting. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation.

6/10/2024

cs.CV

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Wiktor Mucha, Martin Kampel

Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.

4/16/2024

cs.CV