Learning Object Permanence from Videos via Latent Imaginations

Read original: arXiv:2310.10372 - Published 4/12/2024 by Manuel Traub, Frederic Becker, Sebastian Otte, Martin V. Butz

🏋️

Overview

Examines how deep learning models struggle to recognize objects' continued existence, unlike human infants
Introduces a novel deep learning system called Loci-Looped that learns about object permanence, inertia, and solidity through observation alone
Loci-Looped can track objects through occlusions, anticipate their reappearance, and show signs of surprise when observing implausible object behavior
Outperforms state-of-the-art models in handling object occlusions and temporary sensory interruptions

Plain English Explanation

Unlike human babies, who demonstrate an understanding of object permanence from a very young age, deep learning models have largely failed to recognize that objects continue to exist even when they are out of sight. To address this, researchers have developed a new deep learning system called Loci-Looped.

Loci-Looped is a slot-based autoregressive model that learns to combine its own internal "imaginations" about objects with the visual information it observes over time. This novel "loop" allows Loci-Looped to learn fundamental physical concepts like object permanence, inertia, and solidity just by watching videos, without any explicit supervision.

As a result, Loci-Looped is able to track objects through occlusions, anticipate when they will reappear, and even show signs of "surprise" when it observes object behavior that doesn't match its expectations. Compared to other state-of-the-art models, Loci-Looped is better able to handle situations where objects become temporarily occluded or the system's sensory input is interrupted.

The researchers argue that Loci-Looped represents the first self-supervised deep learning model that can learn about object permanence directly from video data, without any explicit teaching or labeling of the concepts involved.

Technical Explanation

The key innovation of the Loci-Looped model is its use of a novel "looped" architecture that allows it to learn about object permanence, inertia, and solidity through observation alone. Unlike previous approaches that relied on explicit supervision or training signals, Loci-Looped learns these fundamental physical concepts in a self-supervised manner.

The model operates by maintaining a set of "slot" representations, each corresponding to a distinct object in the scene. Over time, as the model observes a video, it adaptively fuses its own internal "imaginations" about the objects' locations and identities with the pixel-level visual information it receives. This fusion of latent and observed information allows Loci-Looped to build up consistent, object-specific representations of "what" and "where" over the course of the video.

Crucially, the model's "looped" architecture means that these object representations don't just passively track the objects, but actively anticipate their future behavior and revise their internal models when observing unexpected events. This allows Loci-Looped to learn about object permanence, directional inertia, and object solidity through observation alone, without any explicit supervision.

In experiments, the researchers show that Loci-Looped outperforms state-of-the-art baseline models on tasks involving object occlusions and temporary sensory interruptions. Additionally, the model's internal activity patterns are more compositional and interpretable compared to other deep learning approaches.

Critical Analysis

The researchers present a compelling approach to endowing deep learning models with a more human-like understanding of object permanence and physical concepts. By using a self-supervised, slot-based architecture that can adaptively fuse latent and observed information, Loci-Looped is able to learn these fundamental properties of the world through observation alone.

However, the paper does not address some potential limitations or areas for further research. For example, it's unclear how well the model would scale to more complex, cluttered scenes with many interacting objects. Additionally, the paper does not explore whether Loci-Looped's learned representations and physical intuitions could be leveraged for other downstream tasks, such as generating realistic training data or commonsense reasoning.

Further research could also investigate the biological plausibility of the Loci-Looped architecture and explore potential connections to how the human brain develops an understanding of object permanence and physical properties. Extending this work to more diverse environments and real-world applications could also be valuable.

Conclusion

The Loci-Looped model represents an important step towards developing deep learning systems that can learn about the physical world in a more human-like way. By introducing a novel self-supervised architecture that can learn about object permanence, inertia, and solidity through observation alone, the researchers have made progress in bridging the gap between artificial and human cognition. While further research is needed to fully explore the implications and potential of this approach, Loci-Looped serves as a promising foundation for future work in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Learning Object Permanence from Videos via Latent Imaginations

Manuel Traub, Frederic Becker, Sebastian Otte, Martin V. Butz

While human infants exhibit knowledge about object permanence from two months of age onwards, deep-learning approaches still largely fail to recognize objects' continued existence. We introduce a slot-based autoregressive deep learning system, the looped location and identity tracking model Loci-Looped, which learns to adaptively fuse latent imaginations with pixel-space observations into consistent latent object-specific what and where encodings over time. The novel loop empowers Loci-Looped to learn the physical concepts of object permanence, directional inertia, and object solidity through observation alone. As a result, Loci-Looped tracks objects through occlusions, anticipates their reappearance, and shows signs of surprise and internal revisions when observing implausible object behavior. Notably, Loci-Looped outperforms state-of-the-art baseline models in handling object occlusions and temporary sensory interruptions while exhibiting more compositional, interpretable internal activity patterns. Our work thus introduces the first self-supervised interpretable learning model that learns about object permanence directly from video data without supervision.

4/12/2024

Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models

Yixuan Huang, Jialin Yuan, Chanho Kim, Pupul Pradhan, Bryan Chen, Li Fuxin, Tucker Hermans

Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.

5/28/2024

✅

Offline Tracking with Object Permanence

Xianzhong Liu, Holger Caesar

To reduce the expensive labor cost for manual labeling autonomous driving datasets, an alternative is to automatically label the datasets using an offline perception system. However, objects might be temporally occluded. Such occlusion scenarios in the datasets are common yet underexplored in offline auto labeling. In this work, we propose an offline tracking model that focuses on occluded object tracks. It leverages the concept of object permanence which means objects continue to exist even if they are not observed anymore. The model contains three parts: a standard online tracker, a re-identification (Re-ID) module that associates tracklets before and after occlusion, and a track completion module that completes the fragmented tracks. The Re-ID module and the track completion module use the vectorized map as one of the inputs to refine the tracking results with occlusion. The model can effectively recover the occluded object trajectories. It achieves state-of-the-art performance in 3D multi-object tracking by significantly improving the original online tracking result, showing its potential to be applied in offline auto labeling as a useful plugin to improve tracking by recovering occlusions.

5/7/2024

Self-supervised learning of video representations from a child's perspective

A. Emin Orhan, Wentao Wang, Alex N. Wang, Mengye Ren, Brenden M. Lake

Children learn powerful internal models of the world around them from a few years of egocentric visual experience. Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases? Recent advances in collecting large-scale, longitudinal, developmentally realistic video datasets and generic self-supervised learning (SSL) algorithms are allowing us to begin to tackle this nature vs. nurture question. However, existing work typically focuses on image-based SSL algorithms and visual capabilities that can be learned from static images (e.g. object recognition), thus ignoring temporal aspects of the world. To close this gap, here we train self-supervised video models on longitudinal, egocentric headcam recordings collected from a child over a two year period in their early development (6-31 months). The resulting models are highly effective at facilitating the learning of action concepts from a small number of labeled examples; they have favorable data size scaling properties; and they display emergent video interpolation capabilities. Video models also learn more robust object representations than image-based models trained with the exact same data. These results suggest that important temporal aspects of a child's internal model of the world may be learnable from their visual experience using highly generic learning algorithms and without strong inductive biases.

7/26/2024