EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

2406.19811

Published 7/1/2024 by Daiwei Zhang, Gengyan Li, Jiajie Li, Mickael Bressieux, Otmar Hilliges, Marc Pollefeys, Luc Van Gool, Xi Wang

cs.CV

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Abstract

Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on mapping 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models.

Create account to get full access

Overview

This paper introduces EgoGaussian, a novel approach for dynamic scene understanding from egocentric video using 3D Gaussian splatting.
EgoGaussian aims to reconstruct and track unknown dynamic objects in 3D, while handling occlusions and leveraging the unique perspective of egocentric video.
The method builds on techniques like Object-Centric Reconstruction and Tracking of Dynamic Unknown Objects, Guess Unseen Dynamic 3D Scene Reconstruction from Events, and OCGaussian: 3D Gaussian Splatting for Occluded Human Rendering.

Plain English Explanation

EgoGaussian is a new way of understanding dynamic scenes from a first-person (egocentric) video. It can reconstruct and track unknown moving objects in 3D, even when they are partially hidden or obscured.

The key idea is to use 3D Gaussian splatting, which represents objects as 3D "blobs" or shapes that can change over time. This allows the system to handle occlusions and the unique perspective of a camera worn by a person, rather than a stationary camera.

EgoGaussian builds on previous work in areas like object tracking, 3D reconstruction from events, and rendering occluded humans. By combining these ideas, the researchers created a powerful system for understanding dynamic scenes from an egocentric point of view.

Technical Explanation

EgoGaussian uses 3D Gaussian splatting to represent and track unknown dynamic objects in egocentric video. This involves modeling objects as 3D Gaussian distributions that can change in shape and position over time.

The system first extracts visual features from the egocentric video and creates an initial 3D representation of the scene using Guess Unseen Dynamic 3D Scene Reconstruction from Events. It then uses Object-Centric Reconstruction and Tracking of Dynamic Unknown Objects to identify and track individual objects.

To handle occlusions, EgoGaussian employs OCGaussian: 3D Gaussian Splatting for Occluded Human Rendering techniques, which allow the 3D Gaussian representations to adapt to partially visible objects.

The system is evaluated on several egocentric video datasets and demonstrates state-of-the-art performance in 3D object reconstruction and tracking, even in the presence of occlusions and clutter.

Critical Analysis

The paper presents a compelling approach to dynamic scene understanding from egocentric video. By leveraging 3D Gaussian splatting, EgoGaussian is able to effectively handle occlusions and the unique perspective of a head-mounted camera.

One potential limitation is the reliance on various existing techniques, which could make the overall system complex and computationally expensive. The authors acknowledge this and suggest exploring ways to streamline the pipeline in future work.

Additionally, the paper does not address how EgoGaussian would perform in more challenging real-world scenarios, such as highly dynamic environments with rapidly moving objects or significant changes in illumination. Further evaluation in diverse settings would help demonstrate the robustness of the approach.

Despite these considerations, EgoGaussian represents an interesting and promising step towards improved dynamic scene understanding from the egocentric viewpoint. The integration of 3D Gaussian splatting with object-centric tracking and reconstruction techniques is a novel and compelling contribution to the field.

Conclusion

EgoGaussian is a novel approach for dynamic scene understanding from egocentric video, using 3D Gaussian splatting to reconstruct and track unknown moving objects. By combining techniques like object-centric tracking, dynamic 3D reconstruction, and handling of occlusions, the system demonstrates state-of-the-art performance in this challenging task.

The integration of these various components into a cohesive framework is a significant contribution, and the paper highlights the potential of leveraging the unique perspective of egocentric video for scene understanding. Further research to optimize the system and evaluate it in more diverse real-world scenarios could lead to even more impactful applications, such as improved robotic perception, augmented reality, and assistive technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Object-centric Reconstruction and Tracking of Dynamic Unknown Objects using 3D Gaussian Splatting

Kuldeep R Barad, Antoine Richard, Jan Dentler, Miguel Olivares-Mendez, Carol Martinez

Generalizable perception is one of the pillars of high-level autonomy in space robotics. Estimating the structure and motion of unknown objects in dynamic environments is fundamental for such autonomous systems. Traditionally, the solutions have relied on prior knowledge of target objects, multiple disparate representations, or low-fidelity outputs unsuitable for robotic operations. This work proposes a novel approach to incrementally reconstruct and track a dynamic unknown object using a unified representation -- a set of 3D Gaussian blobs that describe its geometry and appearance. The differentiable 3D Gaussian Splatting framework is adapted to a dynamic object-centric setting. The input to the pipeline is a sequential set of RGB-D images. 3D reconstruction and 6-DoF pose tracking tasks are tackled using first-order gradient-based optimization. The formulation is simple, requires no pre-training, assumes no prior knowledge of the object or its motion, and is suitable for online applications. The proposed approach is validated on a dataset of 10 unknown spacecraft of diverse geometry and texture under arbitrary relative motion. The experiments demonstrate successful 3D reconstruction and accurate 6-DoF tracking of the target object in proximity operations over a short to medium duration. The causes of tracking drift are discussed and potential solutions are outlined.

5/31/2024

cs.RO

Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

Inhee Lee, Byungjun Kim, Hanbyul Joo

In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.

4/23/2024

cs.CV

OccGaussian: 3D Gaussian Splatting for Occluded Human Rendering

Jingrui Ye, Zongkai Zhang, Yujiao Jiang, Qingmin Liao, Wenming Yang, Zongqing Lu

Rendering dynamic 3D human from monocular videos is crucial for various applications such as virtual reality and digital entertainment. Most methods assume the people is in an unobstructed scene, while various objects may cause the occlusion of body parts in real-life scenarios. Previous method utilizing NeRF for surface rendering to recover the occluded areas, but it requiring more than one day to train and several seconds to render, failing to meet the requirements of real-time interactive applications. To address these issues, we propose OccGaussian based on 3D Gaussian Splatting, which can be trained within 6 minutes and produces high-quality human renderings up to 160 FPS with occluded input. OccGaussian initializes 3D Gaussian distributions in the canonical space, and we perform occlusion feature query at occluded regions, the aggregated pixel-align feature is extracted to compensate for the missing information. Then we use Gaussian Feature MLP to further process the feature along with the occlusion-aware loss functions to better perceive the occluded area. Extensive experiments both in simulated and real-world occlusions, demonstrate that our method achieves comparable or even superior performance compared to the state-of-the-art method. And we improving training and inference speeds by 250x and 800x, respectively. Our code will be available for research purposes.

4/16/2024

cs.CV

Event3DGS: Event-based 3D Gaussian Splatting for Fast Egomotion

Tianyi Xiong, Jiayi Wu, Botao He, Cornelia Fermuller, Yiannis Aloimonos, Heng Huang, Christopher A. Metzler

By combining differentiable rendering with explicit point-based scene representations, 3D Gaussian Splatting (3DGS) has demonstrated breakthrough 3D reconstruction capabilities. However, to date 3DGS has had limited impact on robotics, where high-speed egomotion is pervasive: Egomotion introduces motion blur and leads to artifacts in existing frame-based 3DGS reconstruction methods. To address this challenge, we introduce Event3DGS, an {em event-based} 3DGS framework. By exploiting the exceptional temporal resolution of event cameras, Event3GDS can reconstruct high-fidelity 3D structure and appearance under high-speed egomotion. Extensive experiments on multiple synthetic and real-world datasets demonstrate the superiority of Event3DGS compared with existing event-based dense 3D scene reconstruction frameworks; Event3DGS substantially improves reconstruction quality (+3dB) while reducing computational costs by 95%. Our framework also allows one to incorporate a few motion-blurred frame-based measurements into the reconstruction process to further improve appearance fidelity without loss of structural accuracy.

6/19/2024

cs.CV