Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Read original: arXiv:2404.05072 - Published 4/9/2024 by Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Overview

This paper presents a novel approach for spatial cognition from egocentric video, called "Lift, Match and Keep" (LMK)
The method leverages 3D object detection and tracking to maintain a spatial representation of the environment, even when objects are occluded or out of view
The researchers demonstrate the effectiveness of their approach on several benchmarks, showing improved performance compared to baseline methods

Plain English Explanation

The paper describes a new way to help computers understand the 3D spatial layout of a scene, using video from a camera worn by a person (called "egocentric" video). Even when objects in the scene are hidden or go out of view, the LMK method can still track their 3D positions. This is done by "lifting" 2D object detections in each video frame to 3D, "matching" them to a persistent 3D model of the environment, and "keeping" track of objects over time, even when they are briefly occluded.

The key innovations of this work are the ways it combines depth cues from the video to infer 3D structure, and how it reasons about object state changes over time to maintain a coherent spatial understanding, even when objects disappear from view. This allows the system to build a more complete picture of the 3D world compared to previous methods.

Technical Explanation

The LMK method first uses a 3D object detector to identify objects in each video frame and estimate their 3D positions. It then matches these 3D detections to a persistent 3D map of the environment, allowing it to track the same objects over time, even when they become occluded or leave the camera's field of view.

Key to this is the "lifting" step, which converts 2D object detections in the video to 3D using depth cues like object size, occlusion, and motion parallax. The "matching" step then associates these 3D detections with the existing 3D model, updating the model as the camera and objects move. Finally, the "keeping" step maintains a coherent 3D representation by reasoning about object permanence - that is, assuming objects continue to exist even when temporarily out of sight.

The researchers evaluate their LMK approach on several benchmark datasets for egocentric video understanding, showing improvements over previous state-of-the-art methods in tasks like 3D object tracking and spatial memory.

Critical Analysis

One limitation of the LMK approach is that it relies on accurate 3D object detection, which can be challenging in complex, cluttered environments. The paper acknowledges this and suggests incorporating additional depth cues or using weaker 3D supervision to address this.

Additionally, the method currently assumes a static environment and may struggle with highly dynamic scenes. Extending the approach to handle moving cameras and non-rigid scene elements could be an important area for future research.

Overall, however, the LMK framework represents a promising step forward in using egocentric video to build rich, persistent 3D scene representations, even when objects are temporarily occluded or out of view. This could have important applications in robotic navigation, augmented reality, and other domains that require a deep understanding of the 3D world.

Conclusion

This paper presents a new method called "Lift, Match and Keep" (LMK) that uses egocentric video to build and maintain a 3D spatial model of the environment, even when objects are temporarily occluded or out of sight. By combining 2D object detections with depth cues and a persistent 3D map, LMK can effectively track the 3D positions of objects over time.

The researchers demonstrate the effectiveness of their approach on several benchmarks, showing improvements over previous state-of-the-art methods. While the approach has some limitations, it represents an important step forward in using egocentric video to achieve a richer, more complete understanding of the 3D world around us.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Chiara Plizzari, Shubham Goel, Toby Perrett, Jacob Chalk, Angjoo Kanazawa, Dima Damen

As humans move around, performing their daily tasks, they are able to recall where they have positioned objects in their environment, even if these objects are currently out of sight. In this paper, we aim to mimic this spatial cognition ability. We thus formulate the task of Out of Sight, Not Out of Mind - 3D tracking active objects using observations captured through an egocentric camera. We introduce Lift, Match and Keep (LMK), a method which lifts partial 2D observations to 3D world coordinates, matches them over time using visual appearance, 3D location and interactions to form object tracks, and keeps these object tracks even when they go out-of-view of the camera - hence keeping in mind what is out of sight. We test LMK on 100 long videos from EPIC-KITCHENS. Our results demonstrate that spatial cognition is critical for correctly locating objects over short and long time scales. E.g., for one long egocentric video, we estimate the 3D location of 50 active objects. Of these, 60% can be correctly positioned in 3D after 2 minutes of leaving the camera view.

4/9/2024

Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models

Yixuan Huang, Jialin Yuan, Chanho Kim, Pupul Pradhan, Bryan Chen, Li Fuxin, Tucker Hermans

Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.

5/28/2024

Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video

Zachary Chavis, Hyun Soo Park, Stephen J. Guy

Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models lack the spatial understanding necessary for robotics applications where the agent must reason about the affordances provided by the 3D world around them. We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions to predict a task's spatial affordance, that is the location where a person would go to accomplish the task. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our learning-based approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.

7/22/2024

3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Yash Bhalgat, Vadim Tschernezki, Iro Laina, Jo~ao F. Henriques, Andrea Vedaldi, Andrew Zisserman

Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility. This paper introduces a novel approach to instance segmentation and tracking in first-person video that leverages 3D awareness to overcome these obstacles. Our method integrates scene geometry, 3D object centroid tracking, and instance segmentation to create a robust framework for analyzing dynamic egocentric scenes. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches. Extensive evaluations on the challenging EPIC Fields dataset demonstrate significant improvements across a range of tracking and segmentation consistency metrics. Specifically, our method outperforms the next best performing approach by $7$ points in Association Accuracy (AssA) and $4.5$ points in IDF1 score, while reducing the number of ID switches by $73%$ to $80%$ across various object categories. Leveraging our tracked instance segmentations, we showcase downstream applications in 3D object reconstruction and amodal video object segmentation in these egocentric settings.

8/20/2024