MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Read original: arXiv:2408.09122 - Published 8/20/2024 by Xiao Zhao, Xukun Zhang, Dingkang Yang, Mingyang Sun, Mingcheng Li, Shunli Wang, Lihua Zhang

MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Overview

This paper proposes MaskBEV, a unified framework for detecting objects and segmenting the bird's eye view (BEV) map in autonomous driving applications.
MaskBEV combines object detection and semantic segmentation tasks, leveraging multi-task learning to improve performance on both.
The key idea is to use a single network to jointly predict 3D object bounding boxes and a BEV semantic map, with shared visual features.

Plain English Explanation

MaskBEV is a new AI system designed for self-driving cars. It aims to address two important tasks: detecting objects (like other vehicles, pedestrians, etc.) and mapping the surrounding environment from a bird's eye view.

Typically, these two tasks - object detection and map segmentation - are tackled separately by different AI models. However, MaskBEV takes a more unified approach, using a single neural network to perform both tasks simultaneously. The model learns shared visual features that are useful for both object detection and map segmentation, allowing it to be more efficient and effective.

By combining these two key capabilities into a single framework, MaskBEV can provide self-driving cars with a more comprehensive and integrated understanding of their surroundings. This could lead to improved safety and navigation for autonomous vehicles.

Technical Explanation

The key innovation of MaskBEV is the use of a shared backbone network to jointly predict 3D object bounding boxes and a semantic BEV map segmentation. This allows the model to leverage common visual features for both tasks, rather than learning them independently.

The network first encodes the input data (e.g. camera images, lidar scans) into a shared feature representation. It then branches off into two parallel heads - one for object detection and one for BEV map segmentation. The object detection head predicts the 3D bounding boxes of objects, while the segmentation head outputs a dense pixel-wise semantic map of the surrounding environment from a bird's eye view.

By optimizing the model for both tasks simultaneously through multi-task learning, MaskBEV is able to achieve stronger performance on each individual task compared to separate models. The shared feature representation helps the model learn more efficient and transferable visual features.

Critical Analysis

The authors acknowledge that MaskBEV, like other BEV perception models, relies on accurate camera-lidar calibration and registration, which can be challenging in practice. Imperfect sensor fusion could degrade the model's performance.

Additionally, the paper does not explore the model's robustness to sensor failures or occlusions, which are common issues in real-world autonomous driving scenarios. Further research is needed to assess the system's reliability in complex, dynamic environments.

While the unified framework is an interesting idea, the authors do not provide a thorough analysis of the trade-offs between the joint learning approach and training separate models for each task. The relative benefits and limitations of this multi-task strategy could be explored in more depth.

Conclusion

MaskBEV presents a novel unified framework for integrating object detection and BEV map segmentation, two crucial capabilities for autonomous driving. By leveraging shared visual features, the model is able to achieve strong performance on both tasks simultaneously.

This type of integrated perception system could be an important step towards building more capable and reliable self-driving vehicles. However, further research is needed to fully understand the practical limitations and robustness of the approach in real-world conditions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Xiao Zhao, Xukun Zhang, Dingkang Yang, Mingyang Sun, Mingcheng Li, Shunli Wang, Lihua Zhang

Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of complementary learning among tasks and decreased performance in multi-task learning (MTL) due to joint training. In this paper, we propose MaskBEV, a masked attention-based MTL paradigm that unifies 3D object detection and bird's eye view (BEV) map segmentation. MaskBEV introduces a task-agnostic Transformer decoder to process these diverse tasks, enabling MTL to be completed in a unified decoder without requiring additional design of specific task heads. To fully exploit the complementary information between BEV map segmentation and 3D object detection tasks in BEV space, we propose spatial modulation and scene-level context aggregation strategies. These strategies consider the inherent dependencies between BEV segmentation and 3D detection, naturally boosting MTL performance. Extensive experiments on nuScenes dataset show that compared with previous state-of-the-art MTL methods, MaskBEV achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation, while also demonstrating slightly leading inference speed.

8/20/2024

🤯

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han

Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.

9/4/2024

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

Shiming Wang, Holger Caesar, Liangliang Nan, Julian F. P. Kooij

Multi-sensor object detection is an active research topic in automated driving, but the robustness of such detection models against missing sensor input (modality missing), e.g., due to a sudden sensor failure, is a critical problem which remains under-studied. In this work, we propose UniBEV, an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities: UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. To facilitate its detector head to handle different input combinations, UniBEV aims to create well-aligned Bird's Eye View (BEV) feature maps from each available modality. Unlike prior BEV-based multi-modal detection methods, all sensor modalities follow a uniform approach to resample features from the native sensor coordinate systems to the BEV features. We furthermore investigate the robustness of various fusion strategies w.r.t. missing modalities: the commonly used feature concatenation, but also channel-wise averaging, and a generalization to weighted averaging termed Channel Normalized Weights. To validate its effectiveness, we compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations. In this setting, UniBEV achieves $52.5 %$ mAP on average over all input combinations, significantly improving over the baselines ($43.5 %$ mAP on average for BEVFusion, $48.7 %$ mAP on average for MetaBEV). An ablation study shows the robustness benefits of fusing by weighted averaging over regular concatenation, and of sharing queries between the BEV encoders of each modality. Our code is available at https://github.com/tudelft-iv/UniBEV.

5/9/2024

OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation

Jian Sun, Yuqi Dai, Chi-Man Vong, Qing Xu, Shengbo Eben Li, Jianqiang Wang, Lei He, Keqiang Li

Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems. It realizes ego-vehicle surrounding environment perception by projecting 2D multi-view images into 3D world space. Recently, BEV segmentation has made notable progress, attributed to better view transformation modules, larger image encoders, or more temporal information. However, there are still two issues: 1) a lack of effective understanding and enhancement of BEV space features, particularly in accurately capturing long-distance environmental features and 2) recognizing fine details of target objects. To address these issues, we propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance through global environment-aware perception and local target object enhancement. OE-BevSeg employs an environment-aware BEV compressor. Based on prior knowledge about the main composition of the BEV surrounding environment varying with the increase of distance intervals, long-sequence global modeling is utilized to improve the model's understanding and perception of the environment. From the perspective of enriching target object information in segmentation results, we introduce the center-informed object enhancement module, using centerness information to supervise and guide the segmentation head, thereby enhancing segmentation performance from a local enhancement perspective. Additionally, we designed a multimodal fusion branch that integrates multi-view RGB image features with radar/LiDAR features, achieving significant performance improvements. Extensive experiments show that, whether in camera-only or multimodal fusion BEV segmentation tasks, our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation, demonstrating superior applicability in the field of autonomous driving.

7/19/2024