PillarNeXt: Improving the 3D detector by introducing Voxel2Pillar feature encoding and extracting multi-scale features

2405.09828

Published 5/21/2024 by Xusheng Li, Chengliang Wang, Shumao Wang, Zhuo Zeng, Ji Liu

PillarNeXt: Improving the 3D detector by introducing Voxel2Pillar feature encoding and extracting multi-scale features

Abstract

The multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based 3D detectors are essential for autonomous driving. Extracting rich multi-scale features is crucial for point cloud-based 3D detectors in autonomous driving due to significant differences in the size of different types of objects. However, because of the real-time requirements, large-size convolution kernels are rarely used to extract large-scale features in the backbone. Current 3D detectors commonly use feature pyramid networks to obtain large-scale features; however, some objects containing fewer point clouds are further lost during down-sampling, resulting in degraded performance. Since pillar-based schemes require much less computation than voxel-based schemes, they are more suitable for constructing real-time 3D detectors. Hence, we propose the PillarNeXt, a pillar-based scheme. We redesigned the feature encoding, the backbone, and the neck of the 3D detector. We propose the Voxel2Pillar feature encoding, which uses a sparse convolution constructor to construct pillars with richer point cloud features, especially height features. The Voxel2Pillar adds more learnable parameters to the feature encoding, enabling the initial pillars to have higher performance ability. We extract multi-scale and large-scale features in the proposed fully sparse backbone, which does not utilize large-size convolutional kernels; the backbone consists of the proposed multi-scale feature extraction module. The neck consists of the proposed sparse ConvNeXt, whose simple structure significantly improves the performance. We validate the effectiveness of the proposed PillarNeXt on the Waymo Open Dataset, and the object detection accuracy for vehicles, pedestrians, and cyclists is improved. We also verify the effectiveness of each proposed module in detail through ablation studies.

Create account to get full access

Overview

This paper introduces PillarNeXt, an improved 3D object detector that uses a novel "Voxel2Pillar" feature encoding and extracts multi-scale features for better performance.
The key innovations are a new way to encode voxel-based features into a more efficient "pillar" representation, and the use of multi-scale feature extraction to capture objects at different sizes.
The authors evaluate PillarNeXt on the popular KITTI 3D object detection benchmark and show significant improvements over previous state-of-the-art methods.

Plain English Explanation

The paper describes a new system for detecting 3D objects, such as cars or pedestrians, from sensor data like point clouds. [Link to PVTransformer paper] Current 3D detectors often struggle to accurately identify small or distant objects, so the researchers developed some clever techniques to address this.

First, they introduce a new way to represent the 3D sensor data, called "Voxel2Pillar" encoding. Instead of breaking the space into small 3D cubes (voxels), they use vertical columns (pillars) which are more efficient. This allows the system to focus its processing power on the most relevant parts of the scene.

[Link to PillarTrack paper] Second, the system extracts features at multiple scales - looking for both large and small objects simultaneously. This multi-scale approach helps ensure that details aren't missed, even for far away or partially occluded objects.

The combined innovations of Voxel2Pillar encoding and multi-scale features result in a 3D object detector that outperforms previous methods on standard benchmark tests. This could lead to improved performance for applications like self-driving cars, where accurate 3D perception is crucial.

Technical Explanation

The core technical contribution of this paper is the introduction of "PillarNeXt", a novel 3D object detection architecture that builds upon previous "pillar-based" methods like [Link to PV-RCNN paper] PointPillars and PV-RCNN.

The key innovations are:

Voxel2Pillar Feature Encoding: Instead of using a standard voxel grid to discretize the 3D point cloud, the authors propose a more efficient "pillar" representation. Points within each vertical column (pillar) are aggregated using a PointNet-based encoder, resulting in a compact feature map.
Multi-Scale Feature Extraction: To better handle objects of varying sizes, the authors extract features at multiple scales using a feature pyramid network. This allows the model to capture both large and small objects effectively.

The PillarNeXt network takes the Voxel2Pillar encoded features as input and processes them through a series of 3D convolutional layers and a transformer-based detection head. This architecture enables efficient and accurate 3D object detection, as demonstrated by the authors' experiments on the KITTI benchmark.

Compared to previous state-of-the-art methods, PillarNeXt achieves significant improvements in 3D object detection performance, particularly for small and distant objects. The authors attribute this to the effectiveness of the Voxel2Pillar encoding and the multi-scale feature extraction.

Critical Analysis

The PillarNeXt paper presents a well-designed and thoroughly evaluated 3D object detection system. The authors have carefully considered the limitations of existing methods and introduced innovative solutions to address them.

One potential concern is the computational complexity of the multi-scale feature extraction, which may impact real-time performance in some applications. The authors do not provide detailed comparisons of runtime or model size compared to previous methods.

Additionally, the paper does not explore the generalization of PillarNeXt to other 3D detection datasets or sensors beyond the KITTI benchmark. Further research could investigate the transferability of the proposed techniques to a broader range of scenarios.

[Link to Sparse Points to Dense Clouds paper] Another area for potential improvement is the handling of sparse point clouds, which can be challenging for many 3D detectors. Incorporating techniques like sparse-to-dense point cloud enhancement could further improve the robustness of PillarNeXt.

Overall, the PillarNeXt paper makes a valuable contribution to the field of 3D object detection, demonstrating the benefits of carefully designed feature encoding and multi-scale processing. The system's strong performance on the KITTI benchmark suggests it could be a promising direction for real-world applications.

Conclusion

The PillarNeXt paper presents an innovative 3D object detection system that addresses the limitations of previous methods through two key innovations: Voxel2Pillar feature encoding and multi-scale feature extraction.

[Link to Local Neighborhood Features for 3D Classification paper] By efficiently representing the 3D point cloud and capturing features at multiple scales, PillarNeXt achieves state-of-the-art results on the KITTI 3D object detection benchmark, particularly for small and distant objects. These advancements could lead to significant improvements in 3D perception for applications like autonomous vehicles, where accurate 3D detection is critical for safe navigation.

While the paper could benefit from further exploration of computational efficiency and generalization to other datasets, the core ideas behind PillarNeXt demonstrate the potential of leveraging novel feature representations and multi-scale processing to push the boundaries of 3D object detection. The research community will likely build upon these concepts to drive continued progress in this important field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection

Zhaoqi Leng, Pei Sun, Tong He, Dragomir Anguelov, Mingxing Tan

3D object detectors for point clouds often rely on a pooling-based PointNet to encode sparse points into grid-like voxels or pillars. In this paper, we identify that the common PointNet design introduces an information bottleneck that limits 3D object detection accuracy and scalability. To address this limitation, we propose PVTransformer: a transformer-based point-to-voxel architecture for 3D detection. Our key idea is to replace the PointNet pooling operation with an attention module, leading to a better point-to-voxel aggregation function. Our design respects the permutation invariance of sparse 3D points while being more expressive than the pooling-based PointNet. Experimental results show our PVTransformer achieves much better performance compared to the latest 3D object detectors. On the widely used Waymo Open Dataset, our PVTransformer achieves state-of-the-art 76.5 mAPH L2, outperforming the prior art of SWFormer by +1.7 mAPH L2.

5/7/2024

cs.CV

PillarTrack: Redesigning Pillar-based Transformer Network for Single Object Tracking on Point Clouds

Weisheng Xu, Sifan Zhou, Zhihang Yuan

LiDAR-based 3D single object tracking (3D SOT) is a critical issue in robotics and autonomous driving. It aims to obtain accurate 3D BBox from the search area based on similarity or motion. However, existing 3D SOT methods usually follow the point-based pipeline, where the sampling operation inevitably leads to redundant or lost information, resulting in unexpected performance. To address these issues, we propose PillarTrack, a pillar-based 3D single object tracking framework. Firstly, we transform sparse point clouds into dense pillars to preserve the local and global geometrics. Secondly, we introduce a Pyramid-type Encoding Pillar Feature Encoder (PE-PFE) design to help the feature representation of each pillar. Thirdly, we present an efficient Transformer-based backbone from the perspective of modality differences. Finally, we construct our PillarTrack tracker based above designs. Extensive experiments on the KITTI and nuScenes dataset demonstrate the superiority of our proposed method. Notably, our method achieves state-of-the-art performance on the KITTI and nuScenes dataset and enables real-time tracking speed. We hope our work could encourage the community to rethink existing 3D SOT tracker designs.We will open source our code to the research community in https://github.com/StiphyJay/PillarTrack.

4/12/2024

cs.CV

PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram

Sifan Zhou, Zhihang Yuan, Dawei Yang, Xubin Wen, Xing Hu, Yuguang Shi, Ziyu Zhao, Xiaobo Lu

Real-time and high-performance 3D object detection plays a critical role in autonomous driving and robotics. Recent pillar-based 3D object detectors have gained significant attention due to their compact representation and low computational overhead, making them suitable for onboard deployment and quantization. However, existing pillar-based detectors still suffer from information loss along height dimension and large numerical distribution difference during pillar feature encoding (PFE), which severely limits their performance and quantization potential. To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance. Motivated by this observation, we propose a height-aware pillar feature encoder named PillarHist. Specifically, PillarHist statistics the discrete distribution of points at different heights within one pillar. This simple yet effective design greatly preserves the information along the height dimension while significantly reducing the computation overhead of the PFE. Meanwhile, PillarHist also constrains the arithmetic distribution of PFE input to a stable range, making it quantization-friendly. Notably, PillarHist operates exclusively within the PFE stage to enhance performance, enabling seamless integration into existing pillar-based methods without introducing complex operations. Extensive experiments show the effectiveness of PillarHist in terms of both efficiency and performance.

5/30/2024

cs.CV cs.RO

✨

PV-SSD: A Multi-Modal Point Cloud Feature Fusion Method for Projection Features and Variable Receptive Field Voxel Features

Yongxin Shao, Aihong Tan, Zhetao Sun, Enhui Zheng, Tianhong Yan, Peng Liao

LiDAR-based 3D object detection and classification is crucial for autonomous driving. However, real-time inference from extremely sparse 3D data is a formidable challenge. To address this problem, a typical class of approaches transforms the point cloud cast into a regular data representation (voxels or projection maps). Then, it performs feature extraction with convolutional neural networks. However, such methods often result in a certain degree of information loss due to down-sampling or over-compression of feature information. This paper proposes a multi-modal point cloud feature fusion method for projection features and variable receptive field voxel features (PV-SSD) based on projection and variable voxelization to solve the information loss problem. We design a two-branch feature extraction structure with a 2D convolutional neural network to extract the point cloud's projection features in bird's-eye view to focus on the correlation between local features. A voxel feature extraction branch is used to extract local fine-grained features. Meanwhile, we propose a voxel feature extraction method with variable sensory fields to reduce the information loss of voxel branches due to downsampling. It avoids missing critical point information by selecting more useful feature points based on feature point weights for the detection task. In addition, we propose a multi-modal feature fusion module for point clouds. To validate the effectiveness of our method, we tested it on the KITTI dataset and ONCE dataset.

4/9/2024

cs.CV