Instance Tracking in 3D Scenes from Egocentric Videos

2312.04117

Published 6/10/2024 by Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes

📉

Abstract

Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods from relevant areas, e.g., single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Our experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches in the egocentric setting. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation.

Create account to get full access

Overview

This paper explores the problem of instance tracking in 3D from egocentric (first-person) videos captured by devices like augmented reality (AR) or virtual reality (VR) headsets.
The authors introduce a new benchmark dataset for this task, which includes RGB and depth videos, camera pose information, and instance-level annotations in both 2D and 3D.
They evaluate two settings for enrolling instances to track: single-view online enrollment and multi-view pre-enrollment.
The authors adapt existing methods from relevant areas like single object tracking (SOT) and present a simple proposal-based approach.
Their experiments show the proposal-based method outperforms SOT-based approaches in the egocentric setting.

Plain English Explanation

Devices like AR and VR headsets can capture how a person interacts with objects in their environment. This information could be used to provide helpful task assistance, like remembering where important objects are located. To do this, the system needs to be able to track specific objects (instances) in 3D as the person moves around.

The researchers created a new dataset to study this "egocentric instance tracking in 3D" (IT3DEgo) problem. The dataset includes video, depth information, and annotations marking the location of objects in both 2D (on the video) and 3D (in the real world).

The researchers evaluated two ways the system could learn which objects to track: 1) the user could point out an object in the moment, and the system would track that one, or 2) the system could be pre-loaded with images of objects it should look for.

The researchers tried adapting existing methods for single object tracking and also developed a simpler approach that uses object detection and segmentation models to propose and match objects to track.

Their results show the simpler proposal-based method works better for this egocentric 3D tracking task than the adapted single object tracking methods. This is likely because the egocentric setting provides useful extra information, like the camera's 3D position, that can aid the tracking.

Technical Explanation

The authors introduce a new benchmark dataset for the task of egocentric instance tracking in 3D (IT3DEgo). The dataset includes RGB and depth videos captured from an egocentric perspective, along with per-frame camera pose information and instance-level annotations in both 2D camera and 3D world coordinates.

The authors evaluate two settings for enrolling instances to track: single-view online enrollment where an instance is specified on-the-fly based on the user's interactions, and multi-view pre-enrollment where instance images are stored in memory ahead of time.

To address IT3DEgo, the authors first re-purpose methods from related areas like single object tracking (SOT). This involves running SOT methods to track instances in 2D frames and then lifting the tracks to 3D using the provided camera pose and depth information.

The authors also present a simple proposal-based method that leverages pre-trained segmentation and detection models to generate object proposals from RGB frames and match them to the enrolled instance images.

Experiments show that the authors' proposal-based method (with no fine-tuning) significantly outperforms the SOT-based approaches in the egocentric setting. The authors argue that the egocentric IT3DEgo problem is made easier by the availability of camera pose information and the ability to use a 3D world coordinate representation.

Critical Analysis

The paper makes a strong case for the importance of egocentric instance tracking in 3D and the potential benefits it could provide for task assistance applications. The new benchmark dataset is a valuable contribution, as it provides a standardized evaluation framework for this emerging area of research.

One limitation of the work is that the proposed methods are evaluated only on the authors' custom dataset, and it's unclear how well they would generalize to other egocentric 3D scenarios. Further research could explore the performance of these approaches on a wider range of egocentric datasets and tasks.

The authors acknowledge that their simple proposal-based method, while effective, is not the most sophisticated approach possible. More advanced techniques incorporating temporal information, attention mechanisms, or explicit 3D reasoning could potentially lead to further improvements in tracking performance.

Additionally, the paper does not deeply explore the implications of this technology for privacy and ethics. As egocentric devices become more prevalent, there will be important questions to consider around the responsible use of this kind of fine-grained tracking data.

Overall, this paper presents an important step forward in the study of instance tracking from egocentric viewpoints. The new dataset and insights provided by the authors' experiments will likely spur further advancements in this promising area of research.

Conclusion

This paper tackles the problem of instance tracking in 3D from egocentric video, which is an important capability for enabling task assistance applications using AR/VR devices. The authors introduce a new benchmark dataset and evaluate both single-view online enrollment and multi-view pre-enrollment settings for this "egocentric instance tracking in 3D" (IT3DEgo) problem.

Their experiments show that a simple proposal-based method outperforms adapted single object tracking approaches in the egocentric setting, likely due to the availability of camera pose information and the ability to reason in 3D world coordinates. This work lays the groundwork for further advancements in egocentric 3D instance tracking, with potential applications in augmented reality, robotics, and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏷️

3D Human Pose Perception from Egocentric Stereo Videos

Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.

5/16/2024

cs.CV

EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams

Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik

Monocular egocentric 3D human motion capture is a challenging and actively researched problem. Existing methods use synchronously operating visual sensors (e.g. RGB cameras) and often fail under low lighting and fast motions, which can be restricting in many applications involving head-mounted devices. In response to the existing limitations, this paper 1) introduces a new problem, i.e., 3D human motion capture from an egocentric monocular event camera with a fisheye lens, and 2) proposes the first approach to it called EventEgo3D (EE3D). Event streams have high temporal resolution and provide reliable cues for 3D human motion capture under high-speed human motions and rapidly changing illumination. The proposed EE3D framework is specifically tailored for learning with event streams in the LNES representation, enabling high 3D reconstruction accuracy. We also design a prototype of a mobile head-mounted device with an event camera and record a real dataset with event observations and the ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D demonstrates robustness and superior 3D accuracy compared to existing solutions across various challenging experiments while supporting real-time 3D pose update rates of 140Hz.

4/15/2024

cs.CV

New!EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Daiwei Zhang, Gengyan Li, Jiajie Li, Mickael Bressieux, Otmar Hilliges, Marc Pollefeys, Luc Van Gool, Xi Wang

Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on mapping 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models.

7/1/2024

cs.CV

A Survey on 3D Egocentric Human Pose Estimation

Md Mushfiqur Azam, Kevin Desai

Egocentric human pose estimation aims to estimate human body poses and develop body representations from a first-person camera perspective. It has gained vast popularity in recent years because of its wide range of applications in sectors like XR-technologies, human-computer interaction, and fitness tracking. However, to the best of our knowledge, there is no systematic literature review based on the proposed solutions regarding egocentric 3D human pose estimation. To that end, the aim of this survey paper is to provide an extensive overview of the current state of egocentric pose estimation research. In this paper, we categorize and discuss the popular datasets and the different pose estimation models, highlighting the strengths and weaknesses of different methods by comparative analysis. This survey can be a valuable resource for both researchers and practitioners in the field, offering insights into key concepts and cutting-edge solutions in egocentric pose estimation, its wide-ranging applications, as well as the open problems with future scope.

4/19/2024

cs.CV