NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising

2403.20034

Published 4/1/2024 by Tianchen Deng, Yanbo Wang, Hongle Xie, Hesheng Wang, Jingchuan Wang, Danwei Wang, Weidong Chen

NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising

Abstract

In recent years, there have been significant advancements in 3D reconstruction and dense RGB-D SLAM systems. One notable development is the application of Neural Radiance Fields (NeRF) in these systems, which utilizes implicit neural representation to encode 3D scenes. This extension of NeRF to SLAM has shown promising results. However, the depth images obtained from consumer-grade RGB-D sensors are often sparse and noisy, which poses significant challenges for 3D reconstruction and affects the accuracy of the representation of the scene geometry. Moreover, the original hierarchical feature grid with occupancy value is inaccurate for scene geometry representation. Furthermore, the existing methods select random pixels for camera tracking, which leads to inaccurate localization and is not robust in real-world indoor environments. To this end, we present NeSLAM, an advanced framework that achieves accurate and dense depth estimation, robust camera tracking, and realistic synthesis of novel views. First, a depth completion and denoising network is designed to provide dense geometry prior and guide the neural implicit representation optimization. Second, the occupancy scene representation is replaced with Signed Distance Field (SDF) hierarchical scene representation for high-quality reconstruction and view synthesis. Furthermore, we also propose a NeRF-based self-supervised feature tracking algorithm for robust real-time tracking. Experiments on various indoor datasets demonstrate the effectiveness and accuracy of the system in reconstruction, tracking quality, and novel view synthesis.

Create account to get full access

Introduction

This text discusses the progress and important properties of visual Simultaneous Localization and Mapping (SLAM) systems. SLAM systems need to incrementally build an accurate 3D model of the scene and estimate the camera position in real-time. They must also handle noisy and incomplete sensor data, and scale to large environments. The ability to synthesize novel views can benefit applications like virtual reality. Overall, the paper highlights the key requirements for real-world deployment of robust and scalable SLAM systems.

Figure 1: 3D reconstruction and novel view synthesis results using NeSLAM. The final reconstruction mesh and images of novel view synthesis at different locations showcase the powerful scene reconstruction capability of our algorithm. We provide the PSNR value in the bottom right corner.

The text discusses existing visual SLAM (Simultaneous Localization and Mapping) systems and the challenges they face in representing the scene effectively. Sparse and dense SLAM systems can perform real-time pose estimation and loop closing, but they struggle to capture essential information, resulting in incomplete scene representations.

Recent advancements in deep learning have led to the development of learning-based SLAM systems, such as CodeSLAM and SceneCode, which aim to improve scene representation capabilities. Neural radiance fields (NeRF) is a promising recent technology that can encode scene geometry in fine detail using differentiable rendering and multi-layer perceptrons (MLP).

The text identifies two key challenges for dense visual SLAM: 1) the inherent limitations of consumer-grade RGB-D sensors, which produce sparse and noisy depth images, and 2) the limitations of existing methods in tracking performance within real-world indoor scenes, as they use a random pixel selection strategy.

To address these challenges, the proposed NeSLAM system includes the following:

A depth completion and denoising network that generates dense and precise depth images with depth uncertainty information. This geometry prior helps guide the neural point sampling and optimization process.
A NeRF-based self-supervised feature tracking network designed for accurate and real-time camera tracking in complex indoor environments. This network leverages the strengths of NeRF and feature tracking to enhance the system's generalization capability.

The proposed NeSLAM system demonstrates superior performance compared to recent implicit mapping approaches on various indoor RGB-D datasets.

Figure 2: The pipeline of our system. The input stream of our system is RGB and depth images, and the output is the implicit scene representation, generated RGB, depth images, depth uncertainty images, and the camera pose. Our system has two parallel threads: the mapping thread and the tracking thread. In the mapping thread, we estimate the dense and accurate depth image along with depth uncertainty. Then we use them to guide the neural point sampling and implicit representation optimization. The hierarchical feature grids are online updated by minimizing our carefully designed loss through differentiable rendering with the system operating. As for the tracking thread, we propose a NeRF-based self-supervised feature tracking network for accurate and robust pose estimation. This network is online self-supervised optimized via backpropagating keypoint loss. Those two threads are running with an alternating optimization.

Related Work

This paper discusses visual SLAM (Simultaneous Localization and Mapping) systems, which construct maps and track the camera's location in real-time. Traditional visual SLAM methods rely on the constructed maps, such as PTAM which uses keyframes, feature matching, and camera localization. Some sparse mapping methods use keypoints for tracking, mapping, and loop closing, making them robust to motion clutter and large indoor environments.

For learning-based SLAM, DTAM pioneered the use of dense maps and view-centric scene representation. More recent dense SLAM systems like Bundle-Fusion and Ba-Net employ bundle adjustment for pose estimation. Other methods propose new scene geometry representations, such as latent codes or probabilistic fields, to improve accuracy.

In contrast, this paper uses an implicit scene mapping approach, which allows for more accurate geometry representation and novel view synthesis. This builds on recent work in neural radiance fields (NeRF) for novel view synthesis, but addresses the challenge of poor surface prediction for reconstruction tasks. The paper discusses methods that combine NeRF with 3D geometry representation, such as signed distance fields, to enable more efficient sampling and accurate surface reconstruction without requiring ground-truth camera poses.

The paper also reviews recent work on estimating camera poses from NeRF, such as iNeRF, NeRF--, and BARF. Going further, the paper mentions iMAP and NICE-SLAM, which combine neural implicit mapping with SLAM. The key differences in this paper's approach are the use of a depth completion and denoising network for more accurate reconstruction, and a self-supervised feature tracking method for robust pose estimation.

Method

The key points from the provided text are:

System Overview:

The system uses a three-level hierarchical feature grid and corresponding decoders to represent scene geometry and color.
A depth completion and denoising network is used to estimate dense depth and depth uncertainty, which guide neural point sampling and NeRF optimization.
A self-supervised feature tracking method is used for robust and accurate camera pose estimation.

Depth Completion and Denoising Network:

The network uses sparse and noisy input depth images and RGB images to predict dense depth, depth uncertainty, and confidence maps.
The network has a two-head encoder-decoder architecture with mirror connections to propagate spatial information.
The network is trained using a negative log-likelihood loss on the depth and uncertainty predictions.

Neural Scene Representation:

The system uses a hierarchical, multi-level grid feature representation with corresponding MLPs for scene geometry.
The geometry is represented using signed distance fields (SDF) to improve representation ability.
An additional feature grid and decoder are used for color representation.
The system uses differentiable rendering to integrate the SDF values and colors for scene representation.

Optimization:

The system jointly optimizes the scene feature grids, color decoder, and camera poses using carefully designed loss functions.
Losses include depth, color, Eikonal, and ICP losses, as well as patch-wise variants for better convergence.
The system incrementally updates the neural networks online during operation.

In summary, the system uses a novel depth completion network, hierarchical scene representation, and self-supervised feature tracking to enable robust and accurate 3D reconstruction and camera tracking.

V Experiments

This section provides implementation details and an evaluation of the proposed method on various datasets. Key points:

The authors used specific hyperparameter settings for the number of sampling points, loss weightings, and optimizers across different datasets.
They evaluated the method on several datasets, including Replica, ScanNet, and TUM RGB-D, to assess scene reconstruction, view synthesis, and camera tracking performance.
On the Replica dataset, the proposed method outperformed prior SLAM systems in accuracy, completeness, and depth estimation.
On ScanNet and the authors' own real-world dataset, the method achieved better camera tracking accuracy compared to prior NeRF-based SLAM systems.
The authors conducted an ablation study to verify the effectiveness of the depth denoising and completion network, hierarchical scene representation with SDF, self-supervised feature tracking, and the loss function design.
The results show these components significantly improve the reconstruction, view synthesis, and camera tracking performance of the proposed approach.

Conclusion

The paper proposes a dense SLAM system called NeSLAM that combines neural implicit scene representation with the SLAM system. It includes a depth denoising and completion network and a self-supervised feature tracking network.

The depth network provides dense depth images with depth uncertainty, which can guide the neural point sampling and enhance scene geometry consistency. The system also incorporates the Signed Distance Field (SDF) value into the hierarchical feature grid to better represent scene geometry.

The proposed NeRF-based self-supervised feature tracking network enables accurate camera tracking and enhances the robustness of the system. Extensive experiments demonstrate the system's effectiveness and accuracy in scene reconstruction, tracking, and view synthesis in complex indoor scenes.

The future work will focus on dynamic scenes, aiming to achieve high reconstruction and localization accuracy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👨‍🏫

Depth Supervised Neural Surface Reconstruction from Airborne Imagery

Vincent Hackstein, Paul Fauth-Mayer, Matthias Rothermel, Norbert Haala

While originally developed for novel view synthesis, Neural Radiance Fields (NeRFs) have recently emerged as an alternative to multi-view stereo (MVS). Triggered by a manifold of research activities, promising results have been gained especially for texture-less, transparent, and reflecting surfaces, while such scenarios remain challenging for traditional MVS-based approaches. However, most of these investigations focus on close-range scenarios, with studies for airborne scenarios still missing. For this task, NeRFs face potential difficulties at areas of low image redundancy and weak data evidence, as often found in street canyons, facades or building shadows. Furthermore, training such networks is computationally expensive. Thus, the aim of our work is twofold: First, we investigate the applicability of NeRFs for aerial image blocks representing different characteristics like nadir-only, oblique and high-resolution imagery. Second, during these investigations we demonstrate the benefit of integrating depth priors from tie-point measures, which are provided during presupposed Bundle Block Adjustment. Our work is based on the state-of-the-art framework VolSDF, which models 3D scenes by signed distance functions (SDFs), since this is more applicable for surface reconstruction compared to the standard volumetric representation in vanilla NeRFs. For evaluation, the NeRF-based reconstructions are compared to results of a publicly available benchmark dataset for airborne images.

4/26/2024

cs.CV

EC-SLAM: Real-time Dense Neural RGB-D SLAM System with Effectively Constrained Global Bundle Adjustment

Guanghao Li, Qi Chen, YuXiang Yan, Jian Pu

We introduce EC-SLAM, a real-time dense RGB-D simultaneous localization and mapping (SLAM) system utilizing Neural Radiance Fields (NeRF). Although recent NeRF-based SLAM systems have demonstrated encouraging outcomes, they have yet to completely leverage NeRF's capability to constrain pose optimization. By employing an effectively constrained global bundle adjustment (BA) strategy, our system makes use of NeRF's implicit loop closure correction capability. This improves the tracking accuracy by reinforcing the constraints on the keyframes that are most pertinent to the optimized current frame. In addition, by implementing a feature-based and uniform sampling strategy that minimizes the number of ineffective constraint points for pose optimization, we mitigate the effects of random sampling in NeRF. EC-SLAM utilizes sparse parametric encodings and the truncated signed distance field (TSDF) to represent the map in order to facilitate efficient fusion, resulting in reduced model parameters and accelerated convergence velocity. A comprehensive evaluation conducted on the Replica, ScanNet, and TUM datasets showcases cutting-edge performance, including enhanced reconstruction accuracy resulting from precise pose estimation, 21 Hz run time, and tracking precision improvements of up to 50%. The source code is available at https://github.com/Lightingooo/EC-SLAM.

4/23/2024

cs.RO

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets for training, thereby helping our model to learn 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.

6/19/2024

cs.CV cs.AI cs.RO

How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: a Survey

Fabio Tosi, Youmin Zhang, Ziren Gong, Erik Sandstrom, Stefano Mattoccia, Martin R. Oswald, Matteo Poggi

Over the past two decades, research in the field of Simultaneous Localization and Mapping (SLAM) has undergone a significant evolution, highlighting its critical role in enabling autonomous exploration of unknown environments. This evolution ranges from hand-crafted methods, through the era of deep learning, to more recent developments focused on Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) representations. Recognizing the growing body of research and the absence of a comprehensive survey on the topic, this paper aims to provide the first comprehensive overview of SLAM progress through the lens of the latest advancements in radiance fields. It sheds light on the background, evolutionary path, inherent strengths and limitations, and serves as a fundamental reference to highlight the dynamic progress and specific challenges.

4/12/2024

cs.CV cs.RO