GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

2405.10591

Published 5/20/2024 by Xin Tan, Wenbin Wu, Zhiwei Zhang, Chaojie Fan, Yong Peng, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

Abstract

3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.

Create account to get full access

Overview

Proposed a new 3D occupancy prediction model called GEOcc
Combines implicit and explicit depth fusion with contextual self-supervision
Aims to improve 3D scene understanding for applications like autonomous driving

Plain English Explanation

The paper introduces a new 3D occupancy prediction model called GEOcc that combines implicit and explicit depth information. Occupancy prediction is an important task for autonomous driving systems, as it allows them to understand the 3D structure of the environment and plan their movements accordingly.

GEOcc takes in 2D images and predicts a 3D occupancy grid, which indicates which parts of the 3D space are occupied and which are empty. To do this, it fuses two different types of depth information - implicit depth, which is learned from the 2D images, and explicit depth, which comes from additional depth sensors. This multi-modal fusion allows the model to get a more complete understanding of the 3D scene.

Additionally, GEOcc uses a self-supervision technique, where the model learns to predict the occupancy of one part of the scene based on the occupancy of the surrounding context. This contextual self-supervision helps the model learn more robust and generalizable 3D representations without requiring as much labeled training data.

Technical Explanation

The key components of GEOcc are:

Implicit-Explicit Depth Fusion: The model takes in 2D images and learns an implicit representation of depth, which is then fused with explicit depth information from additional sensors. This multi-modal fusion allows for a more complete and accurate 3D understanding.
Contextual Self-Supervision: The model is trained to predict the occupancy of one part of the 3D scene based on the occupancy of the surrounding context. This self-supervision technique helps the model learn robust 3D representations without requiring as much labeled training data.
Geometric Enhancements: The model incorporates geometric priors and constraints to improve the quality and consistency of the predicted 3D occupancy grids.

The authors evaluate GEOcc on several benchmark datasets for 3D occupancy prediction, and show that it outperforms state-of-the-art methods in terms of accuracy and efficiency.

Critical Analysis

The paper presents a compelling approach to 3D occupancy prediction that leverages multiple depth cues and self-supervision. However, there are a few potential limitations and areas for further research:

The effectiveness of the contextual self-supervision may depend on the diversity and quality of the training data, which can be challenging to obtain for 3D scenes.
The geometric enhancements introduced in the model may not be generalizable to all types of scenes, and the authors could explore more adaptive or learned geometric priors.
The model's performance on challenging real-world autonomous driving scenarios, with complex dynamic scenes and occlusions, is not fully explored in the current evaluation.

Overall, the GEOcc model represents an interesting and promising step forward in the field of 3D scene understanding for applications like autonomous driving. Further research to address the limitations and expand the model's capabilities could lead to even more robust and reliable 3D perception systems.

Conclusion

The GEOcc model presented in this paper demonstrates a novel approach to 3D occupancy prediction that combines implicit and explicit depth information with contextual self-supervision. By leveraging multiple depth cues and learning robust 3D representations, GEOcc shows improvements over state-of-the-art methods in terms of accuracy and efficiency. While the paper highlights some promising results, there are also opportunities for further research to address the identified limitations and expand the model's capabilities for real-world autonomous driving applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

Jingyi Pan, Zipeng Wang, Lin Wang

3D semantic occupancy prediction is a pivotal task in the field of autonomous driving. Recent approaches have made great advances in 3D semantic occupancy predictions on a single modality. However, multi-modal semantic occupancy prediction approaches have encountered difficulties in dealing with the modality heterogeneity, modality misalignment, and insufficient modality interactions that arise during the fusion of different modalities data, which may result in the loss of important geometric and semantic information. This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ, which couples explicit LiDAR-camera feature fusion with implicit volume rendering regularization. The key insight is that volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images while serving as a physical regularization to enhance LiDAR-camera fused volumetric representation. Specifically, we first propose a Geometric- and Semantic-aware Fusion (GSFusion) module to explicitly enhance LiDAR features by incorporating neighboring camera features through a K-nearest neighbors (KNN) search. Then, we employ volume rendering to project the fused feature back to the image planes for reconstructing color and depth maps. These maps are then supervised by input images from the camera and depth estimations derived from LiDAR, respectively. Extensive experiments on the popular nuScenes and SemanticKITTI benchmarks verify the effectiveness of our Co-Occ for 3D semantic occupancy prediction. The project page is available at https://rorisis.github.io/Co-Occ_project-page/.

5/24/2024

cs.CV

Real-time 3D semantic occupancy prediction for autonomous vehicles using memory-efficient sparse convolution

Samuel Sze, Lars Kunze

In autonomous vehicles, understanding the surrounding 3D environment of the ego vehicle in real-time is essential. A compact way to represent scenes while encoding geometric distances and semantic object information is via 3D semantic occupancy maps. State of the art 3D mapping methods leverage transformers with cross-attention mechanisms to elevate 2D vision-centric camera features into the 3D domain. However, these methods encounter significant challenges in real-time applications due to their high computational demands during inference. This limitation is particularly problematic in autonomous vehicles, where GPU resources must be shared with other tasks such as localization and planning. In this paper, we introduce an approach that extracts features from front-view 2D camera images and LiDAR scans, then employs a sparse convolution network (Minkowski Engine), for 3D semantic occupancy prediction. Given that outdoor scenes in autonomous driving scenarios are inherently sparse, the utilization of sparse convolution is particularly apt. By jointly solving the problems of 3D scene completion of sparse scenes and 3D semantic segmentation, we provide a more efficient learning framework suitable for real-time applications in autonomous vehicles. We also demonstrate competitive accuracy on the nuScenes dataset.

5/21/2024

cs.RO cs.CV

🔮

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving

Guoqing Wang, Zhongdao Wang, Pin Tang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma

Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere. In this paper, we introduce OccGen, a simple yet powerful generative perception model for the task of 3D semantic occupancy prediction. OccGen adopts a ''noise-to-occupancy'' generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions.

4/24/2024

cs.CV

EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network

Yining Shi, Kun Jiang, Ke Wang, Kangan Qian, Yunlong Wang, Jiusi Li, Tuopu Wen, Mengmeng Yang, Yiliang Xu, Diange Yang

3D occupancy prediction (Occ) is a rapidly rising challenging perception task in the field of autonomous driving which represents the driving scene as uniformly partitioned 3D voxel grids with semantics. Compared to 3D object detection, grid perception has great advantage of better recognizing irregularly shaped, unknown category, or partially occluded general objects. However, existing 3D occupancy networks (occnets) are both computationally heavy and label-hungry. In terms of model complexity, occnets are commonly composed of heavy Conv3D modules or transformers on the voxel level. In terms of label annotations requirements, occnets are supervised with large-scale expensive dense voxel labels. Model and data inefficiency, caused by excessive network parameters and label annotations requirement, severely hinder the onboard deployment of occnets. This paper proposes an efficient 3d occupancy network (EFFOcc), that targets the minimal network complexity and label requirement while achieving state-of-the-art accuracy. EFFOcc only uses simple 2D operators, and improves Occ accuracy to the state-of-the-art on multiple large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On Occ3D-nuScenes benchmark, EFFOcc has only 18.4M parameters, and achieves 50.46 in terms of mean IoU (mIoU), to our knowledge, it is the occnet with minimal parameters compared with related occnets. Moreover, we propose a two-stage active learning strategy to reduce the requirements of labelled data. Active EFFOcc trained with 6% labelled voxels achieves 47.19 mIoU, which is 95.7% fully supervised performance. The proposed EFFOcc also supports improved vision-only occupancy prediction with the aid of region-decomposed distillation. Code and demo videos will be available at https://github.com/synsin0/EFFOcc.

6/12/2024

cs.CV