OccFusion: A Straightforward and Effective Multi-Sensor Fusion Framework for 3D Occupancy Prediction

2403.01644

Published 5/10/2024 by Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall

OccFusion: A Straightforward and Effective Multi-Sensor Fusion Framework for 3D Occupancy Prediction

Abstract

A comprehensive understanding of 3D scenes is crucial in autonomous vehicles (AVs), and recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, existing methods for 3D occupancy prediction heavily rely on surround-view camera images, making them susceptible to changes in lighting and weather conditions. This paper introduces OccFusion, a novel sensor fusion framework for predicting 3D occupancy. By integrating features from additional sensors, such as lidar and surround view radars, our framework enhances the accuracy and robustness of occupancy prediction, resulting in top-tier performance on the nuScenes benchmark. Furthermore, extensive experiments conducted on the nuScenes and semanticKITTI dataset, including challenging night and rainy scenarios, confirm the superior performance of our sensor fusion strategy across various perception ranges. The code for this framework will be made available at https://github.com/DanielMing123/OccFusion.

Create account to get full access

Overview

This paper presents OccFusion, a straightforward and effective multi-sensor fusion framework for 3D occupancy prediction.
The goal is to enable robust environment perception for autonomous driving and other applications by fusing data from multiple sensors.
OccFusion combines different sensor modalities, including cameras, LiDAR, and radar, to generate a comprehensive 3D occupancy map of the environment.

Plain English Explanation

OccFusion is a new system that combines information from different types of sensors to create a detailed 3D map of the surroundings. This can be very useful for self-driving cars and other autonomous systems that need to understand their environment.

Autonomous vehicles rely on sensors like cameras, laser scanners (LiDAR), and radar to perceive their surroundings. Each sensor has its own strengths and weaknesses - for example, cameras provide rich visual information but can be affected by lighting conditions, while LiDAR is great at measuring distances but doesn't capture color details.

By fusing the data from multiple sensors, OccFusion can create a more comprehensive and reliable 3D occupancy map of the environment. This map shows which areas are occupied by objects and which are free space that the vehicle can navigate through.

The key innovation in OccFusion is its straightforward and effective approach to combining the sensor data. Rather than using complex deep learning models, it relies on simpler, more interpretable techniques. This makes the system more robust and easier to understand than some previous multi-sensor fusion methods.

Overall, OccFusion aims to enable more reliable environment perception for autonomous driving and other applications that require a thorough understanding of the 3D world.

Technical Explanation

OccFusion is designed to fuse data from various sensors, including cameras, LiDAR, and radar, to generate an accurate 3D occupancy map of the environment. Unlike some previous multi-sensor fusion approaches that rely on complex deep learning models, OccFusion uses a more straightforward and interpretable technique.

The system first processes the raw sensor data to extract relevant features. For example, it uses computer vision algorithms to detect objects in camera images and LiDAR-based methods to estimate the 3D shape and position of obstacles. It then fuses these features using a probabilistic occupancy mapping approach.

The key innovation in OccFusion is its use of explicit feature fusion rather than end-to-end deep learning. This allows the system to maintain more control over the fusion process and make it more transparent and interpretable.

In experiments, OccFusion demonstrated strong performance in predicting future occupancy grids and outperformed several deep learning-based baselines. The authors attribute this to OccFusion's ability to effectively leverage the complementary strengths of different sensor modalities.

Critical Analysis

One potential limitation of OccFusion is that it may not be as flexible or adaptive as deep learning-based approaches, which can potentially learn more complex sensor fusion strategies from data. The authors acknowledge that future work could explore ways to incorporate more advanced machine learning techniques while maintaining the system's interpretability.

Additionally, the paper does not provide a detailed analysis of OccFusion's robustness to sensor failures or environmental conditions that could degrade the performance of individual sensors. Further research may be needed to understand the system's behavior in more challenging real-world scenarios.

Overall, OccFusion presents a promising approach to multi-sensor fusion for 3D occupancy prediction, with a focus on simplicity, interpretability, and effective leveraging of complementary sensor information. As autonomous systems continue to advance, techniques like OccFusion will play a crucial role in enabling robust and reliable environment perception.

Conclusion

The OccFusion framework offers a straightforward and effective way to combine data from multiple sensors, such as cameras, LiDAR, and radar, to generate a comprehensive 3D map of the environment. By fusing the complementary strengths of these sensor modalities using an explicit feature fusion approach, OccFusion can provide more reliable and interpretable occupancy prediction for autonomous driving and other applications.

While the paper suggests that OccFusion outperforms some deep learning-based baselines, further research may be needed to assess its robustness and adaptability in more challenging real-world conditions. Nevertheless, the core ideas behind OccFusion, such as its focus on simplicity and interpretability, represent an important step forward in the development of robust multi-sensor fusion systems for autonomous systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

Jingyi Pan, Zipeng Wang, Lin Wang

3D semantic occupancy prediction is a pivotal task in the field of autonomous driving. Recent approaches have made great advances in 3D semantic occupancy predictions on a single modality. However, multi-modal semantic occupancy prediction approaches have encountered difficulties in dealing with the modality heterogeneity, modality misalignment, and insufficient modality interactions that arise during the fusion of different modalities data, which may result in the loss of important geometric and semantic information. This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ, which couples explicit LiDAR-camera feature fusion with implicit volume rendering regularization. The key insight is that volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images while serving as a physical regularization to enhance LiDAR-camera fused volumetric representation. Specifically, we first propose a Geometric- and Semantic-aware Fusion (GSFusion) module to explicitly enhance LiDAR features by incorporating neighboring camera features through a K-nearest neighbors (KNN) search. Then, we employ volume rendering to project the fused feature back to the image planes for reconstructing color and depth maps. These maps are then supervised by input images from the camera and depth estimations derived from LiDAR, respectively. Extensive experiments on the popular nuScenes and SemanticKITTI benchmarks verify the effectiveness of our Co-Occ for 3D semantic occupancy prediction. The project page is available at https://rorisis.github.io/Co-Occ_project-page/.

5/24/2024

cs.CV

🔮

RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar

Fangqiang Ding, Xiangyu Wen, Lawrence Zhu, Yiming Li, Chris Xiaoxuan Lu

3D occupancy-based perception pipeline has significantly advanced autonomous driving by capturing detailed scene descriptions and demonstrating strong generalizability across various object categories and shapes. Current methods predominantly rely on LiDAR or camera inputs for 3D occupancy prediction. These methods are susceptible to adverse weather conditions, limiting the all-weather deployment of self-driving cars. To improve perception robustness, we leverage the recent advances in automotive radars and introduce a novel approach that utilizes 4D imaging radar sensors for 3D occupancy prediction. Our method, RadarOcc, circumvents the limitations of sparse radar point clouds by directly processing the 4D radar tensor, thus preserving essential scene details. RadarOcc innovatively addresses the challenges associated with the voluminous and noisy 4D radar data by employing Doppler bins descriptors, sidelobe-aware spatial sparsification, and range-wise self-attention mechanisms. To minimize the interpolation errors associated with direct coordinate transformations, we also devise a spherical-based feature encoding followed by spherical-to-Cartesian feature aggregation. We benchmark various baseline methods based on distinct modalities on the public K-Radar dataset. The results demonstrate RadarOcc's state-of-the-art performance in radar-based 3D occupancy prediction and promising results even when compared with LiDAR- or camera-based methods. Additionally, we present qualitative evidence of the superior performance of 4D radar in adverse weather conditions and explore the impact of key pipeline components through ablation studies.

6/14/2024

cs.CV cs.AI cs.LG cs.RO

EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network

Yining Shi, Kun Jiang, Ke Wang, Kangan Qian, Yunlong Wang, Jiusi Li, Tuopu Wen, Mengmeng Yang, Yiliang Xu, Diange Yang

3D occupancy prediction (Occ) is a rapidly rising challenging perception task in the field of autonomous driving which represents the driving scene as uniformly partitioned 3D voxel grids with semantics. Compared to 3D object detection, grid perception has great advantage of better recognizing irregularly shaped, unknown category, or partially occluded general objects. However, existing 3D occupancy networks (occnets) are both computationally heavy and label-hungry. In terms of model complexity, occnets are commonly composed of heavy Conv3D modules or transformers on the voxel level. In terms of label annotations requirements, occnets are supervised with large-scale expensive dense voxel labels. Model and data inefficiency, caused by excessive network parameters and label annotations requirement, severely hinder the onboard deployment of occnets. This paper proposes an efficient 3d occupancy network (EFFOcc), that targets the minimal network complexity and label requirement while achieving state-of-the-art accuracy. EFFOcc only uses simple 2D operators, and improves Occ accuracy to the state-of-the-art on multiple large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On Occ3D-nuScenes benchmark, EFFOcc has only 18.4M parameters, and achieves 50.46 in terms of mean IoU (mIoU), to our knowledge, it is the occnet with minimal parameters compared with related occnets. Moreover, we propose a two-stage active learning strategy to reduce the requirements of labelled data. Active EFFOcc trained with 6% labelled voxels achieves 47.19 mIoU, which is 95.7% fully supervised performance. The proposed EFFOcc also supports improved vision-only occupancy prediction with the aid of region-decomposed distillation. Code and demo videos will be available at https://github.com/synsin0/EFFOcc.

6/12/2024

cs.CV

GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

Xin Tan, Wenbin Wu, Zhiwei Zhang, Chaojie Fan, Yong Peng, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.

5/20/2024

cs.CV