BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Read original: arXiv:2406.08785 - Published 6/14/2024 by Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, Jingdong Wang, Gaoang Wang, Xi Li

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Overview

This paper presents BEVSpread, a novel method for representing birds-eye-view (BEV) information in vision-based 3D object detection.
BEVSpread introduces a "spread voxel pooling" technique that effectively aggregates features from multiple camera views to create a robust BEV representation.
The proposed method is evaluated on roadside 3D object detection benchmarks and demonstrates improved performance compared to existing BEV generation approaches.

Plain English Explanation

The paper discusses a new way to create a top-down, or "birds-eye-view" (BEV), representation from camera images for the purpose of 3D object detection. This is an important task for autonomous vehicles and robotics, as it allows the system to understand the 3D layout of the environment around it.

The key innovation in this work is a "spread voxel pooling" technique, which takes feature information from multiple camera views and aggregates it into a single, coherent BEV representation. This helps to make the BEV more robust and accurate, as it can draw on the collective information from all the available camera inputs.

The researchers tested their BEVSpread method on standard 3D object detection benchmarks focused on roadside environments. They found that it outperformed other state-of-the-art BEV generation approaches, indicating it is a promising technique for vision-based 3D perception in real-world settings.

Technical Explanation

The paper introduces a new method called "BEVSpread" for generating effective birds-eye-view (BEV) representations from multi-view camera inputs for 3D object detection. The key innovation is a "spread voxel pooling" technique that aggregates features across multiple camera views into a unified BEV feature map.

Specifically, BEVSpread first projects features from each camera view onto a 3D voxel grid. It then "spreads" these voxel features outwards along the camera ray, effectively fusing information from different viewpoints into each voxel. This allows the model to build a more holistic and robust BEV representation compared to prior approaches that rely on single-view or simplified feature fusion.

The BEVSpread architecture is evaluated on two standard roadside 3D object detection benchmarks (PointBEV, GraphBEV). The results show that BEVSpread outperforms alternative BEV generation methods, achieving state-of-the-art performance on these tasks. This indicates the effectiveness of the spread voxel pooling technique for fusing multi-view cues into a cohesive BEV representation.

Critical Analysis

The paper provides a compelling technical approach for building robust BEV representations from multi-view camera inputs. The key strength of BEVSpread is its ability to effectively aggregate features across views through the spread voxel pooling mechanism.

However, the paper does not deeply explore the limitations of the proposed method. For example, it is unclear how BEVSpread would scale to larger numbers of camera views or handle significant occlusions or missing data from individual views. Additionally, the computational and memory requirements of the spread voxel pooling operation are not analyzed in depth.

Furthermore, while the results on the PointBEV and GraphBEV benchmarks are strong, it would be valuable to see evaluations on a broader range of 3D perception tasks and real-world driving scenarios. This could help validate the general applicability of the BEVSpread approach.

Overall, this is a technically solid contribution to the field of vision-based 3D object detection. However, further research is needed to fully understand the limitations and generalization capabilities of the proposed method.

Conclusion

The BEVSpread paper introduces a novel technique for generating effective birds-eye-view representations from multi-view camera inputs for 3D object detection. The key innovation is a "spread voxel pooling" approach that fuses features across views, resulting in a more robust BEV representation.

Experimental results on standard roadside 3D detection benchmarks demonstrate the effectiveness of BEVSpread, outperforming alternative BEV generation methods. This suggests the proposed approach could be a valuable tool for vision-based 3D perception in autonomous vehicles and robotics applications.

While further research is needed to fully understand the limitations and broader applicability of BEVSpread, this work represents an important advancement in the field of multi-view 3D object detection. The spread voxel pooling technique provides a promising direction for building increasingly reliable and comprehensive BEV representations from camera sensor data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, Jingdong Wang, Gaoang Wang, Xi Li

Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist.

6/14/2024

🌐

PointBeV: A Sparse Approach to BeV Predictions

Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord

Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at https://github.com/valeoai/PointBeV.

5/24/2024

GeoBEV: Learning Geometric BEV Representation for Multi-view 3D Object Detection

Jinqing Zhang, Yanan Zhang, Yunlong Qi, Zehua Fu, Qingjie Liu, Yunhong Wang

Bird's-Eye-View (BEV) representation has emerged as a mainstream paradigm for multi-view 3D object detection, demonstrating impressive perceptual capabilities. However, existing methods overlook the geometric quality of BEV representation, leaving it in a low-resolution state and failing to restore the authentic geometric information of the scene. In this paper, we identify the reasons why previous approaches are constrained by low BEV representation resolution and propose Radial-Cartesian BEV Sampling (RC-Sampling), enabling efficient generation of high-resolution dense BEV representations without the need for complex operators. Additionally, we design a novel In-Box Label to substitute the traditional depth label generated from the LiDAR points. This label reflects the actual geometric structure of objects rather than just their surfaces, injecting real-world geometric information into the BEV representation. Furthermore, in conjunction with the In-Box Label, a Centroid-Aware Inner Loss (CAI Loss) is developed to capture the fine-grained inner geometric structure of objects. Finally, we integrate the aforementioned modules into a novel multi-view 3D object detection framework, dubbed GeoBEV. Extensive experiments on the nuScenes dataset exhibit that GeoBEV achieves state-of-the-art performance, highlighting its effectiveness.

9/4/2024

🤷

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu Shen, Fenggang Liu, Enze Xie, Lu Sheng, Wanli Ouyang, Jing Shao

Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: https://github.com/Sense-GVT/Fast-BEV.

7/10/2024