MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering

Read original: arXiv:2408.11464 - Published 8/22/2024 by Yonglin Tian, Songlin Bai, Zhiyao Luo, Yutong Wang, Yisheng Lv, Fei-Yue Wang

MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering

Overview

Proposes a visual state space model called "MambaOcc" for occupancy prediction using bird's-eye-view (BEV) data
Introduces a local adaptive reordering mechanism to enhance the model's performance
Evaluates the approach on benchmark datasets and demonstrates improved occupancy prediction accuracy

Plain English Explanation

The paper presents a new model called MambaOcc that aims to improve the accuracy of predicting occupancy - where objects are located - using bird's-eye-view (BEV) data. BEV data provides a top-down view of a scene, which can be useful for tasks like self-driving cars or robotics.

The key idea behind MambaOcc is to use a visual state space model. This means the model learns to represent the occupancy state of a scene as a set of hidden variables that evolve over time. By modeling the occupancy state this way, the model can better capture the dynamic nature of real-world scenes.

To further enhance the model's performance, the authors introduce a local adaptive reordering mechanism. This allows the model to dynamically adjust how it processes different parts of the BEV data, focusing more on the most relevant regions. This helps the model make more accurate occupancy predictions.

The researchers evaluate MambaOcc on standard benchmark datasets and show that it outperforms other state-of-the-art occupancy prediction approaches. This suggests the visual state space modeling and local adaptive reordering techniques are effective for this task.

Technical Explanation

The paper proposes a visual state space model called MambaOcc for the task of occupancy prediction using bird's-eye-view (BEV) data.

The key components of MambaOcc are:

Visual State Space Model: The model represents the occupancy state of the scene as a set of hidden variables that evolve over time. This allows it to better capture the dynamic nature of real-world scenes.
Local Adaptive Reordering: The authors introduce a mechanism that dynamically adjusts how the model processes different parts of the BEV data, focusing more on the most relevant regions. This helps improve the model's occupancy prediction accuracy.

The paper evaluates MambaOcc on standard benchmark datasets for occupancy prediction and shows that it outperforms other state-of-the-art approaches. This demonstrates the effectiveness of the visual state space modeling and local adaptive reordering techniques for this task.

Critical Analysis

The paper provides a thorough technical explanation of the MambaOcc model and its key components. However, the authors do not address certain potential limitations or areas for future research:

Computational Complexity: The addition of the local adaptive reordering mechanism may increase the computational complexity of the model, which could be a concern for real-time applications.
Generalization Ability: The paper only evaluates MambaOcc on specific benchmark datasets. It would be valuable to understand how the model performs on a wider range of scenarios, especially in more complex or diverse environments.
Interpretability: State space models can be difficult to interpret, as the hidden variables may not have a clear physical meaning. The paper could have discussed ways to improve the interpretability of the model's predictions.
Sensor Fusion: The paper focuses solely on BEV data, but in many real-world applications, occupancy prediction may benefit from fusing information from multiple sensors, such as cameras, LiDAR, or radar. Exploring how MambaOcc can be extended to leverage multimodal data could be an interesting direction for future research.

Conclusion

The MambaOcc model presented in this paper demonstrates a novel approach to occupancy prediction using bird's-eye-view data. By combining visual state space modeling with local adaptive reordering, the model achieves improved accuracy over other state-of-the-art methods.

This work highlights the potential of advanced modeling techniques, such as state space models, for enhancing perception capabilities in applications like self-driving cars, robotics, and smart city infrastructure. Further research to address the identified limitations and explore multimodal sensor fusion could further strengthen the practical impact of this approach.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering

Yonglin Tian, Songlin Bai, Zhiyao Luo, Yutong Wang, Yisheng Lv, Fei-Yue Wang

Occupancy prediction has attracted intensive attention and shown great superiority in the development of autonomous driving systems. The fine-grained environmental representation brought by occupancy prediction in terms of both geometry and semantic information has facilitated the general perception and safe planning under open scenarios. However, it also brings high computation costs and heavy parameters in existing works that utilize voxel-based 3d dense representation and Transformer-based quadratic attention. To address these challenges, in this paper, we propose a Mamba-based occupancy prediction method (MambaOcc) adopting BEV features to ease the burden of 3D scenario representation, and linear Mamba-style attention to achieve efficient long-range perception. Besides, to address the sensitivity of Mamba to sequence order, we propose a local adaptive reordering (LAR) mechanism with deformable convolution and design a hybrid BEV encoder comprised of convolution layers and Mamba. Extensive experiments on the Occ3D-nuScenes dataset demonstrate that MambaOcc achieves state-of-the-art performance in terms of both accuracy and computational efficiency. For example, compared to FlashOcc, MambaOcc delivers superior results while reducing the number of parameters by 42% and computational costs by 39%. Code will be available at https://github.com/Hub-Tian/MambaOcc.

8/22/2024

OccMamba: Semantic Occupancy Prediction with State Space Models

Heng Li, Yuenan Hou, Xiaohan Xing, Xiao Sun, Yanyong Zhang

Training deep learning models for semantic occupancy prediction is challenging due to factors such as a large number of occupancy cells, severe occlusion, limited visual cues, complicated driving scenarios, etc. Recent methods often adopt transformer-based architectures given their strong capability in learning input-conditioned weights and long-range relationships. However, transformer-based networks are notorious for their quadratic computation complexity, seriously undermining their efficacy and deployment in semantic occupancy prediction. Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. However, directly applying the Mamba architecture to the occupancy prediction task yields unsatisfactory performance due to the inherent domain gap between the linguistic and 3D domains. To relieve this problem, we present a simple yet effective 3D-to-1D reordering operation, i.e., height-prioritized 2D Hilbert expansion. It can maximally retain the spatial structure of point clouds as well as facilitate the processing of Mamba blocks. Our OccMamba achieves state-of-the-art performance on three prevalent occupancy prediction benchmarks, including OpenOccupancy, SemanticKITTI and SemanticPOSS. Notably, on OpenOccupancy, our OccMamba outperforms the previous state-of-the-art Co-Occ by 3.1% IoU and 3.2% mIoU, respectively. Codes will be released upon publication.

8/20/2024

📈

Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model

Xu Han, Yuan Tang, Zhaoxuan Wang, Xianzhi Li

Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D.

9/4/2024

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Xiao Wang, Chao wang, Shiao Wang, Xixi Wang, Zhicheng Zhao, Lin Zhu, Bo Jiang

Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on https://github.com/Event-AHU/MambaEVT

8/21/2024