Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

Read original: arXiv:2404.15506 - Published 8/19/2024 by Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen

Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

Overview

Metric3D v2 is a versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.
It can accurately predict 3D scene shape without requiring any additional training or fine-tuning.
The model demonstrates strong performance on various benchmarks, making it a valuable tool for computer vision tasks.

Plain English Explanation

Metric3D v2 is a powerful AI model that can analyze a single image and estimate the 3D shape of the scene. This includes predicting the metric depth (the actual distances between objects) as well as the surface normals (the orientation of surfaces in the 3D space).

What's unique about Metric3D v2 is that it can do all of this without requiring any additional training or fine-tuning on new datasets. It's a "foundation model" that can be applied to a wide variety of scenes and tasks, like benchmarking monocular geometry estimation models or enabling self-supervised 3D depth estimation.

By providing accurate 3D scene understanding from a single image, Metric3D v2 can be a valuable tool for applications like autonomous driving, augmented reality, and 3D reconstruction.

Technical Explanation

The key innovation in Metric3D v2 is its ability to perform zero-shot monocular metric depth and surface normal estimation. This means the model can accurately predict the 3D geometry of a scene without requiring any additional training on that specific scene or dataset.

The architecture of Metric3D v2 builds on previous work in geometric deep learning, using a combination of convolutional neural networks and transformer-based modules to extract rich visual features. These features are then processed through a series of task-specific heads to generate the final depth and normal predictions.

A critical aspect of the model is its use of self-supervised pretraining on large-scale datasets. This allows Metric3D v2 to learn general visual representations that can be effectively transferred to new scenes and tasks, without the need for expensive per-task fine-tuning.

The researchers demonstrate the versatility of Metric3D v2 through extensive evaluations on standard benchmarks for depth and normal estimation. The model achieves state-of-the-art performance, showcasing its ability to generalize well to diverse visual inputs.

Critical Analysis

One potential limitation of Metric3D v2 is that it may not be as accurate as specialized models trained on specific datasets or tasks. While the zero-shot capabilities are impressive, there could be a trade-off in terms of peak performance compared to more targeted approaches.

Additionally, the model's reliance on self-supervised pretraining means that the quality of the underlying datasets and pretraining procedures can have a significant impact on the final performance. Careful curation and selection of the pretraining data could be an important area for further research.

It would also be interesting to see how Metric3D v2 performs in real-world applications, where factors like occlusions, lighting variations, and sensor noise could pose additional challenges. Further testing and deployment in practical scenarios could provide valuable insights into the model's strengths and weaknesses.

Conclusion

Metric3D v2 represents an exciting advancement in monocular 3D scene understanding. By providing accurate and versatile depth and surface normal estimation from a single image, the model has the potential to enable a wide range of computer vision applications, from autonomous navigation to interactive augmented reality experiences.

The zero-shot capabilities of Metric3D v2 make it a valuable "foundation model" that can be readily applied to new tasks and datasets without the need for extensive retraining or fine-tuning. As AI models continue to become more powerful and flexible, tools like Metric3D v2 will play an increasingly important role in unlocking the full potential of 3D computer vision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen

We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. SoTA monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our project page is at https://JUGGHM.github.io/Metric3Dv2.

8/19/2024

TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs

Horatiu Florea, Sergiu Nedevschi

Aerial scene understanding systems face stringent payload restrictions and must often rely on monocular depth estimation for modelling scene geometry, which is an inherently ill-posed problem. Moreover, obtaining accurate ground truth data required by learning-based methods raises significant additional challenges in the aerial domain. Self-supervised approaches can bypass this problem, at the cost of providing only up-to-scale results. Similarly, recent supervised solutions which make good progress towards zero-shot generalization also provide only relative depth values. This work presents TanDepth, a practical, online scale recovery method for obtaining metric depth results from relative estimations at inference-time, irrespective of the type of model generating them. Tailored for Unmanned Aerial Vehicle (UAV) applications, our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view using extrinsic and intrinsic information. An adaptation to the Cloth Simulation Filter is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points. We evaluate and compare our method against alternate scaling methods adapted for UAVs, on a variety of real-world scenes. Considering the limited availability of data for this domain, we construct and release a comprehensive, depth-focused extension to the popular UAVid dataset to further research.

9/10/2024

Incorporating dense metric depth into neural 3D representations for view synthesis and relighting

Arkadeep Narayan Chaudhury, Igor Vasiljevic, Sergey Zakharov, Vitor Guizilini, Rares Ambrus, Srinivasa Narasimhan, Christopher G. Atkeson

Synthesizing accurate geometry and photo-realistic appearance of small scenes is an active area of research with compelling use cases in gaming, virtual reality, robotic-manipulation, autonomous driving, convenient product capture, and consumer-level photography. When applying scene geometry and appearance estimation techniques to robotics, we found that the narrow cone of possible viewpoints due to the limited range of robot motion and scene clutter caused current estimation techniques to produce poor quality estimates or even fail. On the other hand, in robotic applications, dense metric depth can often be measured directly using stereo and illumination can be controlled. Depth can provide a good initial estimate of the object geometry to improve reconstruction, while multi-illumination images can facilitate relighting. In this work we demonstrate a method to incorporate dense metric depth into the training of neural 3D representations and address an artifact observed while jointly refining geometry and appearance by disambiguating between texture and geometry edges. We also discuss a multi-flash stereo camera system developed to capture the necessary data for our pipeline and show results on relighting and view synthesis with a few training views.

9/6/2024

GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion

Vitor Guizilini, Pavel Tokmakov, Achal Dave, Rares Ambrus

3D reconstruction from a single image is a long-standing problem in computer vision. Learning-based methods address its inherent scale ambiguity by leveraging increasingly large labeled and unlabeled datasets, to produce geometric priors capable of generating accurate predictions across domains. As a result, state of the art approaches show impressive performance in zero-shot relative and metric depth estimation. Recently, diffusion models have exhibited remarkable scalability and generalizable properties in their learned representations. However, because these models repurpose tools originally designed for image generation, they can only operate on dense ground-truth, which is not available for most depth labels, especially in real-world settings. In this paper we present GRIN, an efficient diffusion model designed to ingest sparse unstructured training data. We use image features with 3D geometric positional encodings to condition the diffusion process both globally and locally, generating depth predictions at a pixel-level. With comprehensive experiments across eight indoor and outdoor datasets, we show that GRIN establishes a new state of the art in zero-shot metric monocular depth estimation even when trained from scratch.

9/17/2024