Mask4Former: Mask Transformer for 4D Panoptic Segmentation

Read original: arXiv:2309.16133 - Published 4/12/2024 by Kadir Yilmaz, Jonas Schult, Alexey Nekrasov, Bastian Leibe

🤿

Overview

This paper proposes a novel transformer-based model called Mask4Former for the task of 4D panoptic segmentation of LiDAR point clouds.
4D panoptic segmentation involves simultaneously identifying and tracking semantic instances over time in sparse, irregular 3D point cloud data.
Mask4Former is the first transformer-based approach that unifies semantic instance segmentation and tracking into a single joint model, without relying on hand-crafted association strategies.

Plain English Explanation

Accurately perceiving and tracking objects over time is crucial for autonomous agents navigating complex, dynamic environments safely. Mask4Former is a new AI model that tackles the challenging problem of 4D panoptic segmentation - simultaneously identifying different semantic objects (like cars, pedestrians, etc.) in 3D point cloud data and tracking those objects over time.

Unlike previous approaches that used separate models or hand-crafted methods to do segmentation and tracking, Mask4Former is a single, unified transformer-based model that can do both tasks together. It uses special "spatio-temporal instance queries" that encode the semantic and geometric properties of each object being tracked, allowing it to directly predict the object instances and how they change over time.

A key insight is that promoting compact, spatially cohesive object predictions is critical, as the spatio-temporal queries can otherwise merge nearby objects of the same semantic class. To address this, Mask4Former regresses 3D bounding box parameters as an auxiliary task, which helps the model keep object instances separate even if they are close together.

Technical Explanation

Mask4Former builds on recent advances in transformer-based segmentation models like Mixed-Query Transformer and Sparse-Laneformer, unifying semantic instance segmentation and temporal tracking into a single framework.

The core of Mask4Former is the spatio-temporal instance query, which encodes both the semantic class and geometric properties of each tracked object. These queries are used to directly predict the instance masks and their associations over time, without relying on separate clustering or voting-based tracking modules.

To encourage spatially compact instance predictions, Mask4Former also regresses 6-DOF bounding box parameters from the instance queries as an auxiliary task. This helps prevent the model from merging nearby instances of the same semantic class, which the authors found to be a common failure mode.

Mask4Former is evaluated on the challenging SemanticKITTI benchmark for 4D panoptic segmentation of LiDAR data, where it achieves a new state-of-the-art score of 68.4 LSTQ.

Critical Analysis

The authors provide a thorough analysis of Mask4Former's strengths and limitations. A key insight is the importance of promoting spatially compact instance predictions, which the authors find to be crucial for accurate 4D panoptic segmentation.

That said, the paper does not explore the model's generalization to other datasets or sensor modalities beyond LiDAR point clouds. Further research would be needed to assess how well the approach transfers to other 3D perception tasks and environments.

Additionally, the computational and memory efficiency of the transformer-based architecture is not examined in detail. As transformer models can be resource-intensive, this could be an important practical consideration for deploying Mask4Former in real-world autonomous systems.

Overall, Mask4Former represents an impressive step forward in unified 3D instance segmentation and tracking. However, the research community should continue exploring ways to make such models more robust, efficient, and generalizable to a wider range of sensing modalities and application domains.

Conclusion

Mask4Former is a novel transformer-based approach that can jointly perform 4D panoptic segmentation of LiDAR point clouds, identifying and tracking semantic object instances over time in a unified manner. By introducing spatio-temporal instance queries and an auxiliary bounding box regression task, the model achieves state-of-the-art performance on the challenging SemanticKITTI benchmark.

This work highlights the potential of transformer architectures to tackle complex 3D perception tasks that require both semantic understanding and spatial-temporal reasoning. As autonomous systems become more advanced, models like Mask4Former will be crucial for enabling safe and robust navigation in dynamic environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Mask4Former: Mask Transformer for 4D Panoptic Segmentation

Kadir Yilmaz, Jonas Schult, Alexey Nekrasov, Bastian Leibe

Accurately perceiving and tracking instances over time is essential for the decision-making processes of autonomous agents interacting safely in dynamic environments. With this intention, we propose Mask4Former for the challenging task of 4D panoptic segmentation of LiDAR point clouds. Mask4Former is the first transformer-based approach unifying semantic instance segmentation and tracking of sparse and irregular sequences of 3D point clouds into a single joint model. Our model directly predicts semantic instances and their temporal associations without relying on hand-crafted non-learned association strategies such as probabilistic clustering or voting-based center prediction. Instead, Mask4Former introduces spatio-temporal instance queries that encode the semantic and geometric properties of each semantic tracklet in the sequence. In an in-depth study, we find that promoting spatially compact instance predictions is critical as spatio-temporal instance queries tend to merge multiple semantically similar instances, even if they are spatially distant. To this end, we regress 6-DOF bounding box parameters from spatio-temporal instance queries, which are used as an auxiliary task to foster spatially compact predictions. Mask4Former achieves a new state-of-the-art on the SemanticKITTI test set with a score of 68.4 LSTQ.

4/12/2024

New!VistaFormer: Scalable Vision Transformers for Satellite Image Time Series Segmentation

Ezra MacDonald, Derek Jacoby, Yvonne Coady

We introduce VistaFormer, a lightweight Transformer-based model architecture for the semantic segmentation of remote-sensing images. This model uses a multi-scale Transformer-based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position-free self-attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi-Head Self-Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop-type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state-of-the-art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.

9/16/2024

ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers

Jinke Li, Xiao He, Chonghua Zhou, Xiaoqiang Cheng, Yang Wen, Dan Zhang

3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, including map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes are available at url{https://github.com/ViewFormerOcc/ViewFormer-Occ}.

7/15/2024

4D Panoptic Scene Graph Generation

Jingkang Yang, Jun Cen, Wenxuan Peng, Shuai Liu, Fangzhou Hong, Xiangtai Li, Kaiyang Zhou, Qifeng Chen, Ziwei Liu

We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.

5/17/2024