EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams

2404.08640

Published 4/15/2024 by Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik

cs.CV

EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams

Abstract

Monocular egocentric 3D human motion capture is a challenging and actively researched problem. Existing methods use synchronously operating visual sensors (e.g. RGB cameras) and often fail under low lighting and fast motions, which can be restricting in many applications involving head-mounted devices. In response to the existing limitations, this paper 1) introduces a new problem, i.e., 3D human motion capture from an egocentric monocular event camera with a fisheye lens, and 2) proposes the first approach to it called EventEgo3D (EE3D). Event streams have high temporal resolution and provide reliable cues for 3D human motion capture under high-speed human motions and rapidly changing illumination. The proposed EE3D framework is specifically tailored for learning with event streams in the LNES representation, enabling high 3D reconstruction accuracy. We also design a prototype of a mobile head-mounted device with an event camera and record a real dataset with event observations and the ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D demonstrates robustness and superior 3D accuracy compared to existing solutions across various challenging experiments while supporting real-time 3D pose update rates of 140Hz.

Create account to get full access

Overview

This paper presents a novel approach for 3D human motion capture from egocentric event streams, called EventEgo3D.
The method leverages the high temporal resolution and low latency of event cameras to estimate the 3D pose of a person in the user's field of view.
The proposed system combines deep learning models with an event-based representation to achieve accurate 3D human pose estimation in real-time.

Plain English Explanation

EventEgo3D is a system that can track the 3D movements of a person from the perspective of a camera worn by the user, like a head-mounted camera. Traditional cameras capture images at a fixed rate, but event cameras are different - they only record changes in the scene, similar to how our eyes work. This allows event cameras to have very fast response times and low latency, which is important for tracking fast-moving objects.

The researchers developed deep learning models that can take the data from an event camera and estimate the 3D position of the person's body parts, like their arms and legs. This allows the system to create a 3D model of the person's movements in real-time, which could be useful for applications like [link to https://aimodels.fyi/papers/arxiv/eventsleep-sleep-activity-recognition-event-cameras]activity recognition[/link] or [link to https://aimodels.fyi/papers/arxiv/multi-person-3d-pose-estimation-from-unlabelled]3D pose estimation[/link] using event cameras.

Technical Explanation

The key innovation of EventEgo3D is its use of event-based representations to enable accurate 3D human pose estimation from egocentric event streams. Event cameras only record changes in pixel intensity, unlike traditional cameras that capture entire frames at a fixed rate. This event-based approach provides several advantages, including high temporal resolution, low latency, and robustness to fast motions.

The EventEgo3D pipeline consists of several deep learning models. First, an event-based person detector is used to localize the person in the event stream. Then, a 3D human pose estimation model takes the event data and predicts the 3D positions of the person's body joints. The final output is a 3D skeletal representation of the person's movements.

The researchers trained and evaluated their models on several datasets, including [link to https://aimodels.fyi/papers/arxiv/eagle-first-event-camera-dataset-gathered-by]EAGLE[/link] and [link to https://aimodels.fyi/papers/arxiv/egogen-egocentric-synthetic-data-generator]EgoGen[/link], demonstrating state-of-the-art performance on 3D human pose estimation from egocentric event streams.

Critical Analysis

The EventEgo3D system represents an important step forward in leveraging event cameras for 3D human motion capture. The use of event-based representations allows the system to achieve high temporal resolution and low latency, which are crucial for accurate 3D pose estimation.

However, the paper does not address some potential limitations of the approach. For example, the system may struggle in scenarios with multiple people in the field of view, as the person detection model would need to reliably segment each individual. Additionally, the reliance on specialized event cameras, which are not yet widely adopted, could limit the practical deployment of the system.

Further research is needed to explore the robustness of EventEgo3D in more complex real-world settings, as well as to investigate ways to make the system more scalable and accessible, such as by leveraging [link to https://aimodels.fyi/papers/arxiv/3d-human-scan-moving-event-camera]event-to-video conversion techniques[/link].

Conclusion

The EventEgo3D system demonstrates the potential of event cameras for 3D human motion capture from an egocentric perspective. By leveraging the unique properties of event-based representations, the researchers have developed a real-time 3D pose estimation system that can track human movements with high accuracy and low latency.

While there are still some challenges to overcome, EventEgo3D represents an important step forward in the field of 3D human pose estimation, with potential applications in areas such as [link to https://aimodels.fyi/papers/arxiv/eagle-first-event-camera-dataset-gathered-by]activity recognition[/link], [link to https://aimodels.fyi/papers/arxiv/multi-person-3d-pose-estimation-from-unlabelled]motion capture[/link], and human-computer interaction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

❗

3D Human Scan With A Moving Event Camera

Kai Kohyama, Shintaro Shiba, Yoshimitsu Aoki

Capturing a 3D human body is one of the important tasks in computer vision with a wide range of applications such as virtual reality and sports analysis. However, conventional frame cameras are limited by their temporal resolution and dynamic range, which imposes constraints in real-world application setups. Event cameras have the advantages of high temporal resolution and high dynamic range (HDR), but the development of event-based methods is necessary to handle data with different characteristics. This paper proposes a novel event-based method for 3D pose estimation and human mesh recovery. Prior work on event-based human mesh recovery require frames (images) as well as event data. The proposed method solely relies on events; it carves 3D voxels by moving the event camera around a stationary body, reconstructs the human pose and mesh by attenuated rays, and fit statistical body models, preserving high-frequency details. The experimental results show that the proposed method outperforms conventional frame-based methods in the estimation accuracy of both pose and body mesh. We also demonstrate results in challenging situations where a conventional camera has motion blur. This is the first to demonstrate event-only human mesh recovery, and we hope that it is the first step toward achieving robust and accurate 3D human body scanning from vision sensors. https://florpeng.github.io/event-based-human-scan/

4/17/2024

cs.CV

🏷️

3D Human Pose Perception from Egocentric Stereo Videos

Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.

5/16/2024

cs.CV

📉

Instance Tracking in 3D Scenes from Egocentric Videos

Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes

Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods from relevant areas, e.g., single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Our experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches in the egocentric setting. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation.

6/10/2024

cs.CV

A Survey on 3D Egocentric Human Pose Estimation

Md Mushfiqur Azam, Kevin Desai

Egocentric human pose estimation aims to estimate human body poses and develop body representations from a first-person camera perspective. It has gained vast popularity in recent years because of its wide range of applications in sectors like XR-technologies, human-computer interaction, and fitness tracking. However, to the best of our knowledge, there is no systematic literature review based on the proposed solutions regarding egocentric 3D human pose estimation. To that end, the aim of this survey paper is to provide an extensive overview of the current state of egocentric pose estimation research. In this paper, we categorize and discuss the popular datasets and the different pose estimation models, highlighting the strengths and weaknesses of different methods by comparative analysis. This survey can be a valuable resource for both researchers and practitioners in the field, offering insights into key concepts and cutting-edge solutions in egocentric pose estimation, its wide-ranging applications, as well as the open problems with future scope.

4/19/2024

cs.CV