RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

Read original: arXiv:2309.10356 - Published 7/2/2024 by Jiahang Li, Yikang Zhang, Peng Yun, Guangliang Zhou, Qijun Chen, Rui Fan

RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

Overview

The paper proposes a novel Duplex Transformer model called RoadFormer for semantic road scene parsing, which combines RGB and normal information to improve performance.
RoadFormer leverages the complementary nature of RGB and normal data to enhance the understanding of road scenes, outperforming state-of-the-art methods on several benchmark datasets.
The model architecture consists of a Duplex Transformer that jointly encodes and decodes RGB and normal information, allowing efficient information exchange between the two modalities.

Plain English Explanation

The RoadFormer model is designed to help computers better understand and interpret images of road scenes. It combines two types of information - color (RGB) and surface normals (normal) - to get a more complete understanding of the scene.

The key idea is that the RGB and normal data provide complementary information that, when used together, can lead to better performance in tasks like identifying different elements of the road scene (such as lanes, vehicles, pedestrians, etc.). The Duplex Transformer architecture allows the model to efficiently exchange information between the RGB and normal data, enabling it to learn more comprehensive representations of the scene.

By leveraging both RGB and normal data, RoadFormer outperforms other state-of-the-art methods on several benchmark datasets for semantic road scene parsing. This means it is better able to accurately identify and label the various components of a road scene, which could be useful for applications like self-driving cars, traffic monitoring, and urban planning.

Technical Explanation

The RoadFormer model is built upon the Transformer architecture, which has shown great success in various computer vision tasks. The authors propose a Duplex Transformer that jointly encodes and decodes both RGB and normal information, allowing efficient information exchange between the two modalities.

The Duplex Transformer consists of two main components: an RGB Transformer and a Normal Transformer. These two transformers operate in parallel, with a cross-attention mechanism that enables them to interact and learn from each other's representations. This allows the model to effectively fuse the complementary RGB and normal data, leading to improved performance on semantic road scene parsing tasks.

The authors evaluate RoadFormer on several benchmark datasets, including Cityscapes, CamVid, and NuScenes, and demonstrate that it outperforms state-of-the-art methods like CCDSReFormer, HAPNet, SERNet-Former, and ActNetFormer. The authors also conduct ablation studies to demonstrate the importance of the Duplex Transformer architecture and the synergy between RGB and normal data.

Critical Analysis

The paper presents a compelling approach to improving semantic road scene parsing by effectively fusing RGB and normal data using a Duplex Transformer architecture. The authors provide a thorough evaluation of their model on several benchmark datasets, demonstrating its superiority over state-of-the-art methods.

One potential limitation of the research is the reliance on normal data, which may not always be readily available in real-world scenarios. The authors acknowledge this and suggest exploring alternative ways to obtain or estimate normal information, such as using SNE-RoadSegV2, which can infer normal information from RGB data.

Additionally, the paper could have delved deeper into the interpretability of the Duplex Transformer's internal workings and how the cross-attention mechanism facilitates the fusion of RGB and normal data. Further analysis of the model's performance on specific road scene elements or in challenging scenarios could also provide valuable insights.

Conclusion

The RoadFormer model presented in this paper demonstrates the potential of leveraging complementary modalities, in this case, RGB and normal data, to improve semantic road scene parsing. By introducing a Duplex Transformer architecture that efficiently fuses these two data sources, the authors have developed a state-of-the-art approach that outperforms other methods on several benchmark datasets.

The successful integration of RGB and normal information highlights the importance of exploring multimodal approaches in computer vision, as they can lead to more comprehensive and accurate scene understanding. The insights from this research could have significant implications for various applications, such as autonomous driving, urban planning, and traffic monitoring, where robust and reliable road scene parsing is crucial.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

Jiahang Li, Yikang Zhang, Peng Yun, Guangliang Zhou, Qijun Chen, Rui Fan

The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.

7/2/2024

RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

Jianxin Huang, Jiahang Li, Ning Jia, Yuxiang Sun, Chengju Liu, Qijun Chen, Rui Fan

Task-specific data-fusion networks have marked considerable achievements in urban scene parsing. Among these networks, our recently proposed RoadFormer successfully extracts heterogeneous features from RGB images and surface normal maps and fuses these features through attention mechanisms, demonstrating compelling efficacy in RGB-Normal road scene parsing. However, its performance significantly deteriorates when handling other types/sources of data or performing more universal, all-category scene parsing tasks. To overcome these limitations, this study introduces RoadFormer+, an efficient, robust, and adaptable model capable of effectively fusing RGB-X data, where ``X'', represents additional types/modalities of data such as depth, thermal, surface normal, and polarization. Specifically, we propose a novel hybrid feature decoupling encoder to extract heterogeneous features and decouple them into global and local components. These decoupled features are then fused through a dual-branch multi-scale heterogeneous feature fusion block, which employs parallel Transformer attentions and convolutional neural network modules to merge multi-scale features across different scales and receptive fields. The fused features are subsequently fed into a decoder to generate the final semantic predictions. Notably, our proposed RoadFormer+ ranks first on the KITTI Road benchmark and achieves state-of-the-art performance in mean intersection over union on the Cityscapes, MFNet, FMB, and ZJU datasets. Moreover, it reduces the number of learnable parameters by 65% compared to RoadFormer. Our source code will be publicly available at mias.group/RoadFormerPlus.

8/23/2024

🔮

SemanticFormer: Holistic and Semantic Traffic Scene Representation for Trajectory Prediction using Knowledge Graphs

Zhigang Sun, Zixu Wang, Lavdim Halilaj, Juergen Luettin

Trajectory prediction in autonomous driving relies on accurate representation of all relevant contexts of the driving scene, including traffic participants, road topology, traffic signs, as well as their semantic relations to each other. Despite increased attention to this issue, most approaches in trajectory prediction do not consider all of these factors sufficiently. We present SemanticFormer, an approach for predicting multimodal trajectories by reasoning over a semantic traffic scene graph using a hybrid approach. It utilizes high-level information in the form of meta-paths, i.e. trajectories on which an agent is allowed to drive from a knowledge graph which is then processed by a novel pipeline based on multiple attention mechanisms to predict accurate trajectories. SemanticFormer comprises a hierarchical heterogeneous graph encoder to capture spatio-temporal and relational information across agents as well as between agents and road elements. Further, it includes a predictor to fuse different encodings and decode trajectories with probabilities. Finally, a refinement module assesses permitted meta-paths of trajectories and speed profiles to obtain final predicted trajectories. Evaluation of the nuScenes benchmark demonstrates improved performance compared to several SOTA methods. In addition, we demonstrate that our knowledge graph can be easily added to two graph-based existing SOTA methods, namely VectorNet and Laformer, replacing their original homogeneous graphs. The evaluation results suggest that by adding our knowledge graph the performance of the original methods is enhanced by 5% and 4%, respectively.

7/2/2024

Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes

Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai

RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.

9/14/2024