MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views






Published 4/11/2024 by Runfa Li, Upal Mahbub, Vasudev Bhaskaran, Truong Nguyen
MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views


Current monocular 3D scene reconstruction (3DR) works are either fully-supervised, or not generalizable, or implicit in 3D representation. We propose a novel framework - MonoSelfRecon that for the first time achieves explicit 3D mesh reconstruction for generalizable indoor scenes with monocular RGB views by purely self-supervision on voxel-SDF (signed distance function). MonoSelfRecon follows an Autoencoder-based architecture, decodes voxel-SDF and a generalizable Neural Radiance Field (NeRF), which is used to guide voxel-SDF in self-supervision. We propose novel self-supervised losses, which not only support pure self-supervision, but can be used together with supervised signals to further boost supervised training. Our experiments show that MonoSelfRecon trained in pure self-supervision outperforms current best self-supervised indoor depth estimation models and is comparable to 3DR models trained in fully supervision with depth annotations. MonoSelfRecon is not restricted by specific model design, which can be used to any models with voxel-SDF for purely self-supervised manner.

Create account to get full access


If you already have an account, we'll log you in


  • This paper presents a new method called MonoSelfRecon for 3D reconstruction of indoor scenes from a single RGB image.
  • The key innovation is that MonoSelfRecon is a purely self-supervised approach, meaning it can learn to reconstruct 3D scenes without any labeled training data.
  • The method is also designed to be generalizable, allowing it to handle a wide variety of indoor scenes rather than being specialized for a particular type of environment.

Plain English Explanation

MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views is a new technique that can create 3D models of indoor spaces using just a single regular photograph. Unlike previous methods, this approach doesn't require any special training data - it can learn to do the 3D reconstruction entirely on its own by looking at unlabeled photos.

The key insight behind MonoSelfRecon is that even without knowing the exact size or layout of a room, there are still patterns and relationships that can be learned from many different indoor scenes. By studying lots of example photographs, the system can figure out the typical structures, shapes, and visual cues that indicate the 3D geometry of a space. It can then use this learned knowledge to reconstruct the 3D structure of a new room from just a single 2D image.

This self-supervised learning approach means MonoSelfRecon is very flexible and can be applied to all kinds of indoor environments, from homes and offices to schools and stores. Previous 3D reconstruction methods often needed to be specially trained on a specific type of building, limiting their usefulness. In contrast, MonoSelfRecon works equally well on a wide variety of scenes, making it a more broadly applicable tool.

Technical Explanation

MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views tackles the problem of 3D reconstruction from single RGB images in a novel way. Rather than relying on paired training data of 2D images and their corresponding 3D ground truth, as many previous methods have done, MonoSelfRecon is a purely self-supervised approach.

The core of the method is a multi-task neural network architecture that learns to predict various geometric properties of the scene, such as depth maps, surface normals, and semantic segmentation, without any labeled training data. Instead, the network is trained on unlabeled RGB images using a combination of self-supervised losses that encourage the predicted outputs to be consistent with each other and with the input image.

For example, the network might learn that in a typical indoor scene, surfaces tend to be planar and meet at right angles, or that furniture and walls usually occupy distinct semantic regions. By discovering and leveraging these types of structural priors, the network can infer the 3D geometry of a new scene from just a single 2D photograph.

Importantly, MonoSelfRecon is designed to be generalizable, meaning it can handle a wide variety of indoor environments, not just a specific type of building. This is achieved through careful network architecture choices and training procedures that encourage the model to learn representations that are transferable across different scenes.

The authors demonstrate the effectiveness of MonoSelfRecon on several standard 3D reconstruction benchmarks, showing that it can produce high-quality 3D reconstructions that are competitive with or even surpass previous state-of-the-art methods that rely on supervised training.

Critical Analysis

One potential limitation of MonoSelfRecon is that, while it can handle a wide variety of indoor scenes, it may still struggle with highly unusual or atypical environments that deviate significantly from the structural patterns it has learned. The authors acknowledge this in the paper and suggest that incorporating additional self-supervised cues or architectural modifications could help address this limitation.

Additionally, while the self-supervised training approach is a strength of the method, it also means that the quality of the 3D reconstructions is ultimately bounded by the information available in the input RGB images. In cases where the 2D images lack sufficient visual cues or are ambiguous, the network may not be able to infer the true 3D structure with high fidelity.

It would be interesting to see how MonoSelfRecon might be combined with other 3D reconstruction techniques, such as those that leverage additional sensor modalities (e.g., depth cameras) or geometric priors learned from large-scale 3D datasets. Such hybrid approaches could potentially overcome the limitations of purely monocular, self-supervised methods while still retaining their advantages in terms of flexibility and generalization.


MonoSelfRecon represents a significant advancement in the field of 3D reconstruction from single RGB images. By introducing a purely self-supervised approach that can learn to reconstruct 3D scenes in a generalizable way, the authors have opened up new possibilities for applying 3D modeling technology in a wide range of real-world applications, from virtual reality and gaming to architectural design and urban planning.

While the method still has some limitations, the core ideas behind MonoSelfRecon demonstrate the potential of self-supervised learning to enable powerful computer vision capabilities without the need for large, manually-curated datasets. As the field of machine learning continues to evolve, we can expect to see more innovative techniques like this that push the boundaries of what's possible with limited supervision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

Junwen Huang, Alexey Artemov, Yujin Chen, Shuaifeng Zhi, Kai Xu, Matthias Nie{ss}ner





Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics with only 2D labeling which can be either manual or machine-generated. Our key technical innovation is to leverage differentiable rendering of color and semantics to bridge 2D observations and unknown 3D space, using the observed RGB images and 2D semantics as supervision, respectively. We additionally develop a learning pipeline and corresponding method to enable learning from imperfect predicted 2D labels, which could be additionally acquired by synthesizing in an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision loop for semantics. As a result, our end-to-end trainable solution jointly addresses geometry completion, colorization, and semantic mapping from limited RGB-D images, without relying on any 3D ground-truth information. Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet, surpasses baselines even with costly 3D annotations in predicting both geometry and semantics. To our knowledge, our method is also the first 2D-driven method addressing completion and semantic segmentation of real-world 3D scans simultaneously.

Read more


Enhancing 2D Representation Learning with a 3D Prior

Enhancing 2D Representation Learning with a 3D Prior

Mehmet Aygun, Prithviraj Dhar, Zhicheng Yan, Oisin Mac Aodha, Rakesh Ranjan





Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion, the majority of current self-supervised methods are tasked with learning from monocular 2D image collections. This is noteworthy as it has been demonstrated that shape-centric visual processing is more robust compared to texture-biased automated methods. Inspired by this, we propose a new approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments, across a range of datasets, we demonstrate that our 3D aware representations are more robust compared to conventional self-supervised baselines.

Read more


Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding

Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding

Yunsong Wang, Na Zhao, Gim Hee Lee





The field of self-supervised 3D representation learning has emerged as a promising solution to alleviate the challenge presented by the scarcity of extensive, well-annotated datasets. However, it continues to be hindered by the lack of diverse, large-scale, real-world 3D scene datasets for source data. To address this shortfall, we propose Generalizable Representation Learning (GRL), where we devise a generative Bayesian network to produce diverse synthetic scenes with real-world patterns, and conduct pre-training with a joint objective. By jointly learning a coarse-to-fine contrastive learning task and an occlusion-aware reconstruction task, the model is primed with transferable, geometry-informed representations. Post pre-training on synthetic data, the acquired knowledge of the model can be seamlessly transferred to two principal downstream tasks associated with 3D scene understanding, namely 3D object detection and 3D semantic segmentation, using real-world benchmark datasets. A thorough series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.

Read more


Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Boris Chidlovskii, Leonid Antsfeld





For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.

Read more
