Towards Domain-agnostic Depth Completion

2207.14466

Published 4/9/2024 by Guangkai Xu, Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Simon Chen, Jia-Wang Bian

🧪

Abstract

Existing depth completion methods are often targeted at a specific sparse depth type and generalize poorly across task domains. We present a method to complete sparse/semi-dense, noisy, and potentially low-resolution depth maps obtained by various range sensors, including those in modern mobile phones, or by multi-view reconstruction algorithms. Our method leverages a data-driven prior in the form of a single image depth prediction network trained on large-scale datasets, the output of which is used as an input to our model. We propose an effective training scheme where we simulate various sparsity patterns in typical task domains. In addition, we design two new benchmarks to evaluate the generalizability and the robustness of depth completion methods. Our simple method shows superior cross-domain generalization ability against state-of-the-art depth completion methods, introducing a practical solution to high-quality depth capture on a mobile device. The code is available at: https://github.com/YvanYin/FillDepth.

Create account to get full access

Overview

The paper presents a method for completing sparse, noisy, and low-resolution depth maps obtained from various range sensors, including those in modern mobile phones or multi-view reconstruction algorithms.
The method leverages a data-driven prior in the form of a single image depth prediction network trained on large-scale datasets, the output of which is used as an input to the model.
The paper introduces an effective training scheme to simulate various sparsity patterns in typical task domains and two new benchmarks to evaluate the generalizability and robustness of depth completion methods.

Plain English Explanation

Depth maps, which provide information about the distance of objects from the camera, are essential for many applications, such as augmented reality and 3D scene reconstruction. However, depth maps obtained from various sensors, including those in modern mobile phones, can be sparse, noisy, and low-resolution, limiting their usefulness.

The researchers in this paper developed a method to address these challenges. Their approach uses a neural network that has been trained on large datasets to predict the depth of objects in a single image. This depth prediction is then used as an input to their depth completion model, which fills in the missing or inaccurate parts of the depth map.

The researchers also designed a training scheme to simulate different types of sparsity patterns that can occur in real-world applications, as well as two new benchmarks to evaluate how well their method can handle these challenging situations. This allows their model to generalize well across different types of depth data, providing a practical solution for high-quality depth capture on mobile devices.

Technical Explanation

The paper presents a depth completion method that can handle sparse, noisy, and low-resolution depth maps obtained from various range sensors, including those in modern mobile phones or multi-view reconstruction algorithms. The key innovation is the use of a data-driven prior in the form of a single image depth prediction network, the output of which is used as an input to the depth completion model.

The researchers propose an effective training scheme to simulate various sparsity patterns in typical task domains, allowing their model to generalize well across different types of depth data. They also design two new benchmarks to evaluate the cross-domain generalization ability and robustness of depth completion methods.

The depth completion model takes the sparse/semi-dense, noisy, and potentially low-resolution depth map as input, along with the output of the single image depth prediction network. The model then learns to fill in the missing or inaccurate parts of the depth map, leveraging the information provided by the depth prediction network.

The researchers show that their simple method outperforms state-of-the-art depth completion methods in terms of cross-domain generalization, introducing a practical solution for high-quality depth capture on mobile devices.

Critical Analysis

The paper presents a promising approach to depth completion that addresses the limitations of existing methods, which are often targeted at specific sparse depth types and generalize poorly across task domains. The use of a data-driven prior in the form of a single image depth prediction network is a clever way to leverage large-scale datasets and improve the performance of the depth completion model.

One potential limitation of the approach is that it relies on the accuracy of the single image depth prediction network. If this network is not sufficiently accurate or generalizable, it could negatively impact the performance of the depth completion model. The researchers do not provide a detailed analysis of the robustness of their approach to variations in the quality of the depth prediction network.

Additionally, the paper introduces two new benchmarks to evaluate the generalizability and robustness of depth completion methods. While this is a valuable contribution, it would be helpful to understand how these benchmarks compare to existing depth completion datasets and challenges, such as the KITTI depth completion benchmark or the NYU Depth V2 dataset. This would provide a more comprehensive perspective on the strengths and weaknesses of the proposed depth completion method.

Conclusion

The paper presents a novel depth completion method that leverages a data-driven prior in the form of a single image depth prediction network to handle sparse, noisy, and low-resolution depth maps obtained from various range sensors. The researchers introduce an effective training scheme to simulate different sparsity patterns and two new benchmarks to evaluate the generalizability and robustness of their approach.

The proposed method outperforms state-of-the-art depth completion methods in terms of cross-domain generalization, making it a practical solution for high-quality depth capture on mobile devices. This work contributes to the ongoing efforts to improve the quality and versatility of depth data, which is critical for a wide range of applications, including augmented reality, 3D scene reconstruction, and depth-based computer vision tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

All-day Depth Completion

Vadim Ezhov, Hyoungseob Park, Zhaoyang Zhang, Rishi Upadhyay, Howard Zhang, Chethan Chinder Chandrappa, Achuta Kadambi, Yunhao Ba, Julie Dorsey, Alex Wong

We propose a method for depth estimation under different illumination conditions, i.e., day and night time. As photometry is uninformative in regions under low-illumination, we tackle the problem through a multi-sensor fusion approach, where we take as input an additional synchronized sparse point cloud (i.e., from a LiDAR) projected onto the image plane as a sparse depth map, along with a camera image. The crux of our method lies in the use of the abundantly available synthetic data to first approximate the 3D scene structure by learning a mapping from sparse to (coarse) dense depth maps along with their predictive uncertainty - we term this, SpaDe. In poorly illuminated regions where photometric intensities do not afford the inference of local shape, the coarse approximation of scene depth serves as a prior; the uncertainty map is then used with the image to guide refinement through an uncertainty-driven residual learning (URL) scheme. The resulting depth completion network leverages complementary strengths from both modalities - depth is sparse but insensitive to illumination and in metric scale, and image is dense but sensitive with scale ambiguity. SpaDe can be used in a plug-and-play fashion, which allows for 25% improvement when augmented onto existing methods to preprocess sparse depth. We demonstrate URL on the nuScenes dataset where we improve over all baselines by an average 11.65% in all-day scenarios, 11.23% when tested specifically for daytime, and 13.12% for nighttime scenes.

5/28/2024

cs.CV

Temporal Lidar Depth Completion

Pietari Kaskela, Philipp Fischer, Timo Roman

Given the lidar measurements from an autonomous vehicle, we can project the points and generate a sparse depth image. Depth completion aims at increasing the resolution of such a depth image by infilling and interpolating the sparse depth values. Like most existing approaches, we make use of camera images as guidance in very sparse or occluded regions. In addition, we propose a temporal algorithm that utilizes information from previous timesteps using recurrence. In this work, we show how a state-of-the-art method PENet can be modified to benefit from recurrency. Our algorithm achieves state-of-the-art results on the KITTI depth completion dataset while adding only less than one percent of additional overhead in terms of both neural network parameters and floating point operations. The accuracy is especially improved for faraway objects and regions containing a low amount of lidar depth samples. Even in regions without any ground truth (like sky and rooftops) we observe large improvements which are not captured by the existing evaluation metrics.

6/18/2024

cs.CV cs.AI

🌐

A Concise but High-performing Network for Image Guided Depth Completion in Autonomous Driving

Moyun Liu, Bing Chen, Youping Chen, Jingming Xie, Lei Yao, Yang Zhang, Joey Tianyi Zhou

Depth completion is a crucial task in autonomous driving, aiming to convert a sparse depth map into a dense depth prediction. Due to its potentially rich semantic information, RGB image is commonly fused to enhance the completion effect. Image-guided depth completion involves three key challenges: 1) how to effectively fuse the two modalities; 2) how to better recover depth information; and 3) how to achieve real-time prediction for practical autonomous driving. To solve the above problems, we propose a concise but effective network, named CENet, to achieve high-performance depth completion with a simple and elegant structure. Firstly, we use a fast guidance module to fuse the two sensor features, utilizing abundant auxiliary features extracted from the color space. Unlike other commonly used complicated guidance modules, our approach is intuitive and low-cost. In addition, we find and analyze the optimization inconsistency problem for observed and unobserved positions, and a decoupled depth prediction head is proposed to alleviate the issue. The proposed decoupled head can better output the depth of valid and invalid positions with very few extra inference time. Based on the simple structure of dual-encoder and single-decoder, our CENet can achieve superior balance between accuracy and efficiency. In the KITTI depth completion benchmark, our CENet attains competitive performance and inference speed compared with the state-of-the-art methods. To validate the generalization of our method, we also evaluate on indoor NYUv2 dataset, and our CENet still achieve impressive results. The code of this work will be available at https://github.com/lmomoy/CHNet.

4/23/2024

cs.CV

🛸

Test-Time Adaptation for Depth Completion

Hyoungseob Park, Anjali Gupta, Alex Wong

It is common to observe performance degradation when transferring models trained on some (source) datasets to target testing data due to a domain gap between them. Existing methods for bridging this gap, such as domain adaptation (DA), may require the source data on which the model was trained (often not available), while others, i.e., source-free DA, require many passes through the testing data. We propose an online test-time adaptation method for depth completion, the task of inferring a dense depth map from a single image and associated sparse depth map, that closes the performance gap in a single pass. We first present a study on how the domain shift in each data modality affects model performance. Based on our observations that the sparse depth modality exhibits a much smaller covariate shift than the image, we design an embedding module trained in the source domain that preserves a mapping from features encoding only sparse depth to those encoding image and sparse depth. During test time, sparse depth features are projected using this map as a proxy for source domain features and are used as guidance to train a set of auxiliary parameters (i.e., adaptation layer) to align image and sparse depth features from the target test domain to that of the source domain. We evaluate our method on indoor and outdoor scenarios and show that it improves over baselines by an average of 21.1%.

5/28/2024

cs.CV cs.LG