SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

Read original: arXiv:2403.08556 - Published 8/16/2024 by Yihao Liu, Feng Xue, Anlong Ming, Mingshuai Zhao, Huadong Ma, Nicu Sebe

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

Overview

The paper presents a novel deep learning model called SM⁴Depth that can accurately estimate metric depth from a single image across multiple cameras and scenes using a single model.
This is a significant advancement over previous depth estimation methods that were often camera or scene-specific.
The model achieves state-of-the-art performance on several benchmark depth estimation datasets.

Plain English Explanation

The researchers have developed a new deep learning-based system called SM⁴Depth that can accurately estimate the depth of objects in an image using only a single camera. This is an important task in computer vision, as knowing the distance of objects from the camera enables many applications like 3D object tracking and augmented reality.

Previous depth estimation methods required either multiple cameras, known camera calibration parameters, or were specific to a particular camera or scene. In contrast, the new SM⁴Depth model can work with a single camera and can generalize to estimate depth across a wide variety of cameras and scenes using a single, unified model.

This scale-invariant capability is a significant advancement, as it means the model can be deployed more flexibly in real-world applications without the need for specialized calibration or data collection for each new camera or setting.

Technical Explanation

The key innovation in SM⁴Depth is the use of a multi-task learning framework that jointly optimizes for several depth-related objectives. This includes estimating not only the absolute depth values, but also the relative depth ordering of pixels and the overall scale of the scene.

By learning these complementary depth cues simultaneously, the model is able to better disentangle the different factors that contribute to the monocular depth signal. This allows it to generalize more effectively to new cameras and scenes compared to previous single-task depth estimation approaches.

The researchers also introduce several architectural modifications, such as using a transformer-based encoder and applying specialized normalization layers, which further enhance the model's ability to capture the complex, multi-scale spatial relationships needed for accurate depth estimation.

Extensive experiments on several challenging benchmark datasets demonstrate that SM⁴Depth outperforms previous state-of-the-art monocular depth estimation methods by a significant margin. This includes achieving new records on the NYU Depth V2, KITTI, and Matterport3D datasets.

Critical Analysis

One limitation of the paper is that it does not provide a detailed analysis of the model's failure cases or discuss potential biases that may arise when deploying the system in the real world. The experiments are primarily conducted on standard academic benchmarks, which may not fully capture the diversity of real-world scenarios.

Additionally, while the model's ability to generalize across cameras and scenes is impressive, the paper does not explore how the performance might degrade as the domain shift increases (e.g., estimating depth for a completely novel camera type or scene context).

Further research could investigate the model's robustness to factors like varying lighting conditions, object occlusions, and scene clutter, which are common challenges in real-world depth estimation tasks.

Conclusion

The SM⁴Depth model represents a significant advancement in the field of monocular metric depth estimation. By learning a unified depth representation that can generalize across multiple cameras and scenes, the researchers have developed a highly flexible system that could enable a wide range of computer vision applications, from robotics to augmented reality.

While the paper demonstrates impressive results on standard benchmarks, further investigation into the model's real-world performance and failure modes would help establish its practical viability and guide future research directions in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model

Yihao Liu, Feng Xue, Anlong Ming, Mingshuai Zhao, Huadong Ma, Nicu Sebe

In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM$^4$Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variation-based unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth gaps of scenes during training. Secondly, we propose a divide and conquer solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, BUPT Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM$^4$Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found https://github.com/mRobotit/SM4Depth.

8/16/2024

ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation

Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Tianzhu Zhang, Yongdong Zhang

Estimating depth from a single image is a challenging visual task. Compared to relative depth estimation, metric depth estimation attracts more attention due to its practical physical significance and critical applications in real-life scenarios. However, existing metric depth estimation methods are typically trained on specific datasets with similar scenes, facing challenges in generalizing across scenes with significant scale variations. To address this challenge, we propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction (SASP) module and an adaptive relative depth estimation (ARDE) module, respectively. The proposed ScaleDepth enjoys several merits. First, the SASP module can implicitly combine structural and semantic features of the images to predict precise scene scales. Second, the ARDE module can adaptively estimate the relative depth distribution of each image within a normalized depth space. Third, our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework, without the need for setting the depth range or fine-tuning model. Extensive experiments demonstrate that our method attains state-of-the-art performance across indoor, outdoor, unconstrained, and unseen scenes. Project page: https://ruijiezhu94.github.io/ScaleDepth

7/12/2024

📉

M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, Haotian Zhang

This paper presents a novel self-supervised two-frame multi-camera metric depth estimation network, termed M${^2}$Depth, which is designed to predict reliable scale-aware surrounding depth in autonomous driving. Unlike the previous works that use multi-view images from a single time-step or multiple time-step images from a single camera, M${^2}$Depth takes temporally adjacent two-frame images from multiple cameras as inputs and produces high-quality surrounding depth. We first construct cost volumes in spatial and temporal domains individually and propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation. We additionally combine the neural prior from SAM features with internal features to reduce the ambiguity between foreground and background and strengthen the depth edges. Extensive experimental results on nuScenes and DDAD benchmarks show M${^2}$Depth achieves state-of-the-art performance. More results can be found in https://heiheishuang.xyz/M2Depth .

5/6/2024

TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs

Horatiu Florea, Sergiu Nedevschi

Aerial scene understanding systems face stringent payload restrictions and must often rely on monocular depth estimation for modelling scene geometry, which is an inherently ill-posed problem. Moreover, obtaining accurate ground truth data required by learning-based methods raises significant additional challenges in the aerial domain. Self-supervised approaches can bypass this problem, at the cost of providing only up-to-scale results. Similarly, recent supervised solutions which make good progress towards zero-shot generalization also provide only relative depth values. This work presents TanDepth, a practical, online scale recovery method for obtaining metric depth results from relative estimations at inference-time, irrespective of the type of model generating them. Tailored for Unmanned Aerial Vehicle (UAV) applications, our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view using extrinsic and intrinsic information. An adaptation to the Cloth Simulation Filter is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points. We evaluate and compare our method against alternate scaling methods adapted for UAVs, on a variety of real-world scenes. Considering the limited availability of data for this domain, we construct and release a comprehensive, depth-focused extension to the popular UAVid dataset to further research.

9/10/2024