Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding

2212.05040

Published 6/10/2024 by Jay Bhanushali, Manivannan Muniyandi, Praneeth Chakravarthula

🤔

Abstract

We present a cross-domain inference technique that learns from synthetic data to estimate depth and normals for in-the-wild omnidirectional 3D scenes encountered in real-world uncontrolled settings. To this end, we introduce UBotNet, an architecture that combines UNet and Bottleneck Transformer elements to predict consistent scene normals and depth. We also introduce the OmniHorizon synthetic dataset containing 24,335 omnidirectional images that represent a wide variety of outdoor environments, including buildings, streets, and diverse vegetation. This dataset is generated from expansive, lifelike virtual spaces and encompasses dynamic scene elements, such as changing lighting conditions, different times of day, pedestrians, and vehicles. Our experiments show that UBotNet achieves significantly improved accuracy in depth estimation and normal estimation compared to existing models. Lastly, we validate cross-domain synthetic-to-real depth and normal estimation on real outdoor images using UBotNet trained solely on our synthetic OmniHorizon dataset, demonstrating the potential of both the synthetic dataset and the proposed network for real-world scene understanding applications.

Create account to get full access

Overview

This paper presents a cross-domain inference technique that learns from synthetic data to estimate depth and normals for omnidirectional 3D scenes encountered in real-world uncontrolled settings.
The authors introduce UBotNet, an architecture that combines UNet and Bottleneck Transformer elements to predict consistent scene normals and depth.
They also introduce the OmniHorizon synthetic dataset containing over 24,000 omnidirectional images representing a wide variety of outdoor environments.
Experiments show that UBotNet achieves significantly improved accuracy in depth and normal estimation compared to existing models, and the authors validate cross-domain synthetic-to-real depth and normal estimation on real outdoor images.

Plain English Explanation

The researchers have developed a new technique that can accurately estimate the depth and surface normals (the direction the surface is facing) of 3D scenes captured in 360-degree omnidirectional images. This is a challenging task, as these real-world scenes can have a lot of variation, such as different buildings, streets, and vegetation.

To address this, the researchers created a large synthetic dataset called OmniHorizon, containing over 24,000 realistic 360-degree outdoor scenes generated from virtual environments. They then trained a new neural network architecture called UBotNet, which combines two powerful machine learning models (UNet and Bottleneck Transformer) to predict the depth and surface normals of these scenes.

The key insight is that by training on this diverse synthetic dataset, the UBotNet model can learn to accurately estimate depth and normals even for real-world outdoor scenes it has never seen before. The researchers found that UBotNet significantly outperforms existing models on these tasks, demonstrating the potential of using synthetic data to improve real-world 3D scene understanding.

Technical Explanation

The paper presents a cross-domain inference technique that learns from synthetic data to estimate depth and normals for in-the-wild omnidirectional 3D scenes. The authors introduce UBotNet, an architecture that combines UNet and Bottleneck Transformer elements to predict consistent scene normals and depth.

The researchers also introduce the OmniHorizon synthetic dataset, containing 24,335 omnidirectional images representing a wide variety of outdoor environments, including buildings, streets, and diverse vegetation. This dataset is generated from expansive, lifelike virtual spaces and encompasses dynamic scene elements, such as changing lighting conditions, different times of day, pedestrians, and vehicles.

Experiments show that UBotNet achieves significantly improved accuracy in depth estimation and normal estimation compared to existing models, such as PanoNormal for indoor 360-degree normal estimation. The authors also validate cross-domain synthetic-to-real depth and normal estimation on real outdoor images using UBotNet trained solely on the synthetic OmniHorizon dataset, demonstrating the potential of both the synthetic dataset and the proposed network for real-world scene understanding applications.

Critical Analysis

The paper presents a compelling approach to leveraging synthetic data to improve real-world 3D scene understanding. The OmniHorizon dataset and UBotNet architecture appear to be well-designed and effective, as evidenced by the significant performance improvements over existing models.

However, the paper does not address certain limitations or potential issues with the research. For example, it is unclear how the synthetic dataset was generated and how representative it is of actual outdoor scenes. There may be biases or simplifications in the virtual environments that could limit the model's generalization to more complex real-world scenarios.

Additionally, the paper does not discuss the computational cost or inference speed of the UBotNet model, which could be important considerations for real-world applications. It would also be interesting to see how the model performs on a wider range of real-world datasets, beyond the validation set used in the paper.

Overall, this research represents an important step forward in the use of synthetic data for improving 3D scene understanding, but further work may be needed to fully understand the limitations and potential of this approach.

Conclusion

This paper presents a novel cross-domain inference technique that leverages synthetic data to accurately estimate depth and surface normals for omnidirectional 3D scenes in the real world. The researchers introduce the UBotNet architecture and the large-scale OmniHorizon synthetic dataset, which together demonstrate significant improvements in depth and normal estimation accuracy compared to existing models.

The ability to accurately predict 3D scene properties from 360-degree images has important implications for various real-world applications, such as autonomous navigation, augmented reality, and urban planning. By using synthetic data to train models like UBotNet, researchers can potentially overcome the challenges of collecting and annotating large-scale real-world datasets for these tasks.

While the paper highlights the potential of this approach, further research is needed to fully address its limitations and ensure the robust performance of these models in diverse real-world settings. Nonetheless, this work represents an important contribution to the field of 3D scene understanding and the use of synthetic data to enable more powerful AI models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Domain-Transferred Synthetic Data Generation for Improving Monocular Depth Estimation

Seungyeop Lee, Knut Peterson, Solmaz Arezoomandan, Bill Cai, Peihan Li, Lifeng Zhou, David Han

A major obstacle to the development of effective monocular depth estimation algorithms is the difficulty in obtaining high-quality depth data that corresponds to collected RGB images. Collecting this data is time-consuming and costly, and even data collected by modern sensors has limited range or resolution, and is subject to inconsistencies and noise. To combat this, we propose a method of data generation in simulation using 3D synthetic environments and CycleGAN domain transfer. We compare this method of data generation to the popular NYUDepth V2 dataset by training a depth estimation model based on the DenseDepth structure using different training sets of real and simulated data. We evaluate the performance of the models on newly collected images and LiDAR depth data from a Husky robot to verify the generalizability of the approach and show that GAN-transformed data can serve as an effective alternative to real-world data, particularly in depth estimation.

5/3/2024

cs.CV cs.AI eess.IV

360 in the Wild: Dataset for Depth Prediction and View Synthesis

Kibaek Park, Francois Rameau, Jaesik Park, In So Kweon

The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360$^{circ}$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

6/28/2024

cs.CV cs.AI

🎯

DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception

Yunze Man, Liang-Yan Gui, Yu-Xiong Wang

Closing the domain gap between training and deployment and incorporating multiple sensor modalities are two challenging yet critical topics for self-driving. Existing work only focuses on single one of the above topics, overlooking the simultaneous domain and modality shift which pervasively exists in real-world scenarios. A model trained with multi-sensor data collected in Europe may need to run in Asia with a subset of input sensors available. In this work, we propose DualCross, a cross-modality cross-domain adaptation framework to facilitate the learning of a more robust monocular bird's-eye-view (BEV) perception model, which transfers the point cloud knowledge from a LiDAR sensor in one domain during the training phase to the camera-only testing scenario in a different domain. This work results in the first open analysis of cross-domain cross-sensor perception and adaptation for monocular 3D tasks in the wild. We benchmark our approach on large-scale datasets under a wide range of domain shifts and show state-of-the-art results against various baselines.

6/13/2024

cs.CV cs.AI cs.RO

Syn-to-Real Unsupervised Domain Adaptation for Indoor 3D Object Detection

Yunsong Wang, Na Zhao, Gim Hee Lee

The use of synthetic data in indoor 3D object detection offers the potential of greatly reducing the manual labor involved in 3D annotations and training effective zero-shot detectors. However, the complicated domain shifts across syn-to-real indoor datasets remains underexplored. In this paper, we propose a novel Object-wise Hierarchical Domain Alignment (OHDA) framework for syn-to-real unsupervised domain adaptation in indoor 3D object detection. Our approach includes an object-aware augmentation strategy to effectively diversify the source domain data, and we introduce a two-branch adaptation framework consisting of an adversarial training branch and a pseudo labeling branch, in order to simultaneously reach holistic-level and class-level domain alignment. The pseudo labeling is further refined through two proposed schemes specifically designed for indoor UDA. Our adaptation results from synthetic dataset 3D-FRONT to real-world datasets ScanNetV2 and SUN RGB-D demonstrate remarkable mAP25 improvements of 9.7% and 9.1% over Source-Only baselines, respectively, and consistently outperform the methods adapted from 2D and 3D outdoor scenarios. The code will be publicly available upon paper acceptance.

6/18/2024

cs.CV