Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation

2212.05315

Published 4/4/2024 by Lior Talker, Aviad Cohen, Erez Yosef, Alexandra Dana, Michael Dinerstein

✨

Abstract

Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with numerous applications. Recently, LIDAR-supervised methods have achieved remarkable per-pixel depth accuracy in outdoor scenes. However, significant errors are typically found in the proximity of depth discontinuities, i.e., depth edges, which often hinder the performance of depth-dependent applications that are sensitive to such inaccuracies, e.g., novel view synthesis and augmented reality. Since direct supervision for the location of depth edges is typically unavailable in sparse LIDAR-based scenes, encouraging the MDE model to produce correct depth edges is not straightforward. To the best of our knowledge this paper is the first attempt to address the depth edges issue for LIDAR-supervised scenes. In this work we propose to learn to detect the location of depth edges from densely-supervised synthetic data, and use it to generate supervision for the depth edges in the MDE training. To quantitatively evaluate our approach, and due to the lack of depth edges GT in LIDAR-based scenes, we manually annotated subsets of the KITTI and the DDAD datasets with depth edges ground truth. We demonstrate significant gains in the accuracy of the depth edges with comparable per-pixel depth accuracy on several challenging datasets. Code and datasets are available at url{https://github.com/liortalker/MindTheEdge}.

Create account to get full access

Overview

Monocular Depth Estimation (MDE) is an important problem in computer vision with many applications
Recent methods using LIDAR data have achieved high accuracy, but struggle with depth edges
Depth edges are crucial for applications like novel view synthesis and augmented reality
This paper proposes a new approach to improve depth edge accuracy in LIDAR-supervised MDE models

Plain English Explanation

Monocular Depth Estimation (MDE) is the task of determining the 3D depth of a scene from a single 2D image. This is an important problem in computer vision that enables applications like virtual reality, autonomous driving, and 3D modeling. Recent MDE methods that use LIDAR (laser-based sensor) data to supervise the training process have achieved very accurate depth predictions overall.

However, these LIDAR-supervised models tend to struggle with correctly estimating the depth at the edges of objects, where the depth changes suddenly. These "depth edges" are crucial for applications that require precise 3D information, like generating novel views of a scene or overlaying digital content on the real world in augmented reality. Inaccuracies at depth edges can significantly degrade the performance of these applications.

The challenge is that LIDAR data, which provides the depth supervision, is sparse and does not directly reveal the location of depth edges. The authors of this paper propose a new approach to address this issue. They use densely-annotated synthetic data to first train a model to detect the location of depth edges. They then use this depth edge detection model to provide additional supervision during the training of the main MDE model, encouraging it to better estimate the depth at object boundaries.

Technical Explanation

The key elements of the proposed approach are:

Depth Edge Detection Model: The authors train a neural network model to detect the location of depth edges in 3D scenes, using densely-annotated synthetic data as supervision.
Depth Edge Supervision: During training of the main MDE model, the authors use the depth edge detection model to generate pseudo-ground truth depth edge labels for the LIDAR-based training data. This provides additional supervision to help the MDE model accurately estimate depth at object boundaries.
Evaluation: To quantitatively assess the depth edge accuracy, the authors manually annotated subsets of the KITTI and DDAD datasets with depth edge ground truth, as this information is not readily available in LIDAR-based datasets.

The authors demonstrate that incorporating the depth edge supervision leads to significant improvements in depth edge accuracy, while maintaining comparable overall depth estimation performance on several challenging datasets.

Critical Analysis

The proposed approach represents an important advance in addressing a key limitation of LIDAR-supervised MDE models. By leveraging additional depth edge supervision, the authors are able to significantly improve the accuracy of depth predictions near object boundaries, which is crucial for many downstream applications.

However, the reliance on manually annotated depth edge ground truth for evaluation is a limitation, as this is a time-consuming and subjective process. The authors acknowledge this and suggest that developing automatic methods for depth edge annotation would be an important area for future work.

Additionally, the authors only evaluate their approach on outdoor driving datasets. It would be valuable to see how the method performs on a wider range of scenes and applications, such as indoor environments or complex urban settings.

Conclusion

This paper presents a novel approach to improve the accuracy of depth edges in LIDAR-supervised monocular depth estimation models. By leveraging depth edge detection supervision from synthetic data, the authors are able to significantly enhance the performance of MDE models near object boundaries, a critical capability for applications like novel view synthesis and augmented reality. While the reliance on manual annotations is a limitation, this work represents an important step forward in addressing a key challenge in monocular depth estimation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

Zhiyuan Cheng, Cheng Han, James Liang, Qifan Wang, Xiangyu Zhang, Dongfang Liu

Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) overlook the domain knowledge of MDE, resulting in suboptimal performance. In this work, we introduce a novel self-supervised adversarial training approach for MDE models, leveraging view synthesis without the need for ground-truth depth. We enhance adversarial robustness against real-world attacks by incorporating L_0-norm-bounded perturbation during training. We evaluate our method against supervised learning-based and contrastive learning-based approaches specifically designed for MDE. Our experiments with two representative MDE networks demonstrate improved robustness against various adversarial attacks, with minimal impact on benign performance.

6/11/2024

cs.CV

🔄

Enhanced Object Tracking by Self-Supervised Auxiliary Depth Estimation Learning

Zhenyu Wei, Yujie He, Zhanchuan Cai

RGB-D tracking significantly improves the accuracy of object tracking. However, its dependency on real depth inputs and the complexity involved in multi-modal fusion limit its applicability across various scenarios. The utilization of depth information in RGB-D tracking inspired us to propose a new method, named MDETrack, which trains a tracking network with an additional capability to understand the depth of scenes, through supervised or self-supervised auxiliary Monocular Depth Estimation learning. The outputs of MDETrack's unified feature extractor are fed to the side-by-side tracking head and auxiliary depth estimation head, respectively. The auxiliary module will be discarded in inference, thus keeping the same inference speed. We evaluated our models with various training strategies on multiple datasets, and the results show an improved tracking accuracy even without real depth. Through these findings we highlight the potential of depth estimation in enhancing object tracking performance.

5/24/2024

cs.CV cs.AI

🖼️

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler

Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.

4/4/2024

cs.CV

DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation

Mengtan Zhang, Yi Feng, Qijun Chen, Rui Fan

There has been a recent surge of interest in learning to perceive depth from monocular videos in an unsupervised fashion. A key challenge in this field is achieving robust and accurate depth estimation in challenging scenarios, particularly in regions with weak textures or where dynamic objects are present. This study makes three major contributions by delving deeply into dense correspondence priors to provide existing frameworks with explicit geometric constraints. The first novelty is a contextual-geometric depth consistency loss, which employs depth maps triangulated from dense correspondences based on estimated ego-motion to guide the learning of depth perception from contextual information, since explicitly triangulated depth maps capture accurate relative distances among pixels. The second novelty arises from the observation that there exists an explicit, deducible relationship between optical flow divergence and depth gradient. A differential property correlation loss is, therefore, designed to refine depth estimation with a specific emphasis on local variations. The third novelty is a bidirectional stream co-adjustment strategy that enhances the interaction between rigid and optical flows, encouraging the former towards more accurate correspondence and making the latter more adaptable across various scenarios under the static scene hypotheses. DCPI-Depth, a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams, achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts. Specifically, it demonstrates accurate depth estimation in texture-less and dynamic regions, and shows more reasonable smoothness.

5/28/2024

cs.CV cs.RO