PointBeV: A Sparse Approach to BeV Predictions

Read original: arXiv:2312.00703 - Published 5/24/2024 by Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord

🌐

Overview

Bird's-eye View (BeV) representations have become the standard in driving applications, providing a unified space for sensor data fusion and supporting various downstream tasks.
Conventional BeV models use fixed-resolution grids, leading to computational inefficiencies due to the uniform allocation of resources across all cells.
To address this, the researchers propose PointBeV, a sparse BeV segmentation model that operates on sparse BeV cells instead of dense grids.

Plain English Explanation

The paper introduces a new way to represent the world around a vehicle, called a Bird's-eye View (BeV) representation. This representation provides a unified space for combining data from different sensors, like cameras and radar, and supports various tasks like detecting pedestrians, vehicles, and roads.

Conventional BeV models use a grid-like structure with a fixed resolution and range, which can be inefficient. These models allocate the same amount of computational power to each cell in the grid, even if some cells are more important than others.

To address this, the researchers developed a new model called PointBeV. Instead of a dense grid, PointBeV uses a sparse set of cells, focusing computational resources on the most important areas. This allows the model to use more memory and process longer time periods, which can be helpful on memory-constrained platforms.

PointBeV uses an efficient two-pass training strategy, where it first focuses on the important regions and then refines the entire representation. At inference time, the model can be adjusted to balance memory usage and performance, making it flexible for different use cases.

The researchers show that PointBeV achieves state-of-the-art results on the nuScenes dataset for detecting vehicles, pedestrians, and lanes, even though it is trained on a sparser set of signals compared to other models.

Technical Explanation

The researchers propose PointBeV, a novel sparse BeV segmentation model that operates on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms.

PointBeV employs an efficient two-pass strategy for training, where it first focuses on regions of interest and then refines the entire representation. At inference time, the model can be used with various memory/performance trade-offs and can flexibly adjust to new specific use cases.

The researchers also introduce two new efficient modules used in the PointBeV architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling.

PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals.

Critical Analysis

The paper provides a comprehensive evaluation of the PointBeV model, including comparisons to other state-of-the-art BeV segmentation approaches. However, the researchers do not discuss potential limitations or areas for further research in depth.

One potential concern is the reliance on the nuScenes dataset, which may not capture the full diversity of real-world driving scenarios. Additional evaluation on other datasets could help validate the model's broader applicability.

Furthermore, the paper does not delve into the computational complexity and inference time of the PointBeV model, which are crucial factors for real-world deployment in memory-constrained platforms. Exploring the trade-offs between performance and memory usage could provide valuable insights for practitioners.

Conclusion

The proposed PointBeV model offers a novel approach to BeV segmentation that addresses the computational inefficiencies of conventional grid-based models. By focusing computational resources on sparse, relevant cells, PointBeV achieves state-of-the-art results on the nuScenes dataset while providing flexible memory and performance trade-offs.

The introduction of efficient modules like Sparse Feature Pulling and Submanifold Attention further highlights the researchers' contributions to the field of BeV representation and segmentation. As autonomous driving systems continue to evolve, models like PointBeV may play a crucial role in enabling more efficient and robust perception capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

PointBeV: A Sparse Approach to BeV Predictions

Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord

Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at https://github.com/valeoai/PointBeV.

5/24/2024

🤷

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu Shen, Fenggang Liu, Enze Xie, Lu Sheng, Wanli Ouyang, Jing Shao

Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: https://github.com/Sense-GVT/Fast-BEV.

7/10/2024

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, Jingdong Wang, Gaoang Wang, Xi Li

Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist.

6/14/2024

GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation

Florian Chabot, Nicolas Granger, Guillaume Lapouge

The Bird's-eye View (BeV) representation is widely used for 3D perception from multi-view camera images. It allows to merge features from different cameras into a common space, providing a unified representation of the 3D scene. The key component is the view transformer, which transforms image views into the BeV. However, actual view transformer methods based on geometry or cross-attention do not provide a sufficiently detailed representation of the scene, as they use a sub-sampling of the 3D space that is non-optimal for modeling the fine structures of the environment. In this paper, we propose GaussianBeV, a novel method for transforming image features to BeV by finely representing the scene using a set of 3D gaussians located and oriented in 3D space. This representation is then splattered to produce the BeV feature map by adapting recent advances in 3D representation rendering based on gaussian splatting. GaussianBeV is the first approach to use this 3D gaussian modeling and 3D scene rendering process online, i.e. without optimizing it on a specific scene and directly integrated into a single stage model for BeV scene understanding. Experiments show that the proposed representation is highly effective and place GaussianBeV as the new state-of-the-art on the BeV semantic segmentation task on the nuScenes dataset.

7/22/2024