SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

2404.02638

Published 4/4/2024 by Junyan Ye, Qiyan Luo, Jinhua Yu, Huaping Zhong, Zhimeng Zheng, Conghui He, Weijia Li

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

Abstract

This paper aims at achieving fine-grained building attribute segmentation in a cross-view scenario, i.e., using satellite and street-view image pairs. The main challenge lies in overcoming the significant perspective differences between street views and satellite views. In this work, we introduce SG-BEV, a novel approach for satellite-guided BEV fusion for cross-view semantic segmentation. To overcome the limitations of existing cross-view projection methods in capturing the complete building facade features, we innovatively incorporate Bird's Eye View (BEV) method to establish a spatially explicit mapping of street-view features. Moreover, we fully leverage the advantages of multiple perspectives by introducing a novel satellite-guided reprojection module, optimizing the uneven feature distribution issues associated with traditional BEV methods. Our method demonstrates significant improvements on four cross-view datasets collected from multiple cities, including New York, San Francisco, and Boston. On average across these datasets, our method achieves an increase in mIOU by 10.13% and 5.21% compared with the state-of-the-art satellite-based and cross-view methods. The code and datasets of this work will be released at https://github.com/yejy53/SG-BEV.

Create account to get full access

Overview

This paper proposes a novel approach called SG-BEV (Satellite-Guided BEV Fusion) for cross-view semantic segmentation, which aims to improve the performance of segmenting ground objects by leveraging both ground-level and satellite imagery.
The key idea is to fuse information from these complementary views to better understand the scene and identify objects on the ground.
The authors demonstrate the effectiveness of their approach on several benchmark datasets, showing improvements over existing methods.

Plain English Explanation

Imagine you're trying to identify different objects on the ground, like buildings, roads, and trees, from a camera mounted on a vehicle. This task, called semantic segmentation, can be challenging because the view from the camera is limited to what's directly in front of it.

Now, what if you could also use information from satellite imagery to get a broader view of the entire area? The researchers in this paper hypothesized that combining these two different perspectives - the ground-level camera view and the satellite bird's-eye view - could lead to better object identification.

Their approach, called SG-BEV, takes the camera view and the satellite imagery, processes them separately using deep learning models, and then fuses the resulting information to produce a more accurate segmentation of the ground objects. The intuition is that the satellite view can provide contextual information about the overall layout of the scene, while the camera view can capture fine-grained details about the objects themselves.

By integrating these complementary sources of information, the researchers were able to demonstrate improved performance on several benchmark datasets, compared to using either the camera view or the satellite view alone. This suggests that their satellite-guided fusion approach is a promising direction for advancing the state-of-the-art in cross-view semantic segmentation.

Technical Explanation

The key components of the SG-BEV approach are:

Ground-Level Perception: The authors use a convolutional neural network (CNN) to process the ground-level camera images and extract features relevant for semantic segmentation.
Satellite-Level Perception: A separate CNN is used to process the satellite imagery and extract corresponding features.
Cross-View Fusion: The ground-level and satellite-level features are then fused using a novel attention-based mechanism that learns to selectively combine the most relevant information from each view.
Segmentation Head: The fused features are passed through a segmentation head, which outputs the final per-pixel semantic labels for the ground objects.

The authors evaluate their SG-BEV approach on several datasets, including Vaihingen, Potsdam, and INRIA Aerial Image Labeling, and demonstrate significant improvements over existing methods that use only a single view (either ground-level or satellite).

Critical Analysis

The authors acknowledge several limitations of their work. First, the performance of SG-BEV is still constrained by the quality and resolution of the available satellite imagery, which may not always be sufficient for accurate object identification. Additionally, the fusion mechanism, while effective, could potentially be further improved by exploring alternative architectures or attention mechanisms.

Another aspect that could be investigated is the robustness of SG-BEV to variations in the input data, such as changes in weather, lighting conditions, or camera viewpoints. Understanding the limitations and edge cases of the approach is important for its practical deployment in real-world applications.

Finally, the paper does not provide a comprehensive analysis of the computational costs and inference times of SG-BEV, which would be valuable information for deploying the system in resource-constrained environments, such as autonomous vehicles or mobile robots.

Conclusion

Overall, the SG-BEV approach proposed in this paper represents a promising step towards improving the performance of cross-view semantic segmentation by leveraging the complementary strengths of ground-level and satellite imagery. The authors have demonstrated the effectiveness of their fusion-based method on several benchmark datasets, suggesting that it could have important applications in areas like urban planning, autonomous navigation, and disaster response.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improving Bird's Eye View Semantic Segmentation by Task Decomposition

Tianhao Zhao, Yongcan Chen, Yu Wu, Tianyang Liu, Bo Du, Peilun Xiao, Shi Qiu, Hongda Yang, Guozhen Li, Yi Yang, Yutian Lin

Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets from distinct perspectives, making the direct point-to-point predicting hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover, our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at https://github.com/happytianhao/TaDe.

4/3/2024

cs.CV cs.AI

LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping

Nikhil Gosala, Kursat Petek, B Ravi Kiran, Senthil Yogamani, Paulo Drews-Jr, Wolfram Burgard, Abhinav Valada

Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only 1% of BEV labels and no additional labeled data.

5/30/2024

cs.CV cs.AI cs.RO

DaF-BEVSeg: Distortion-aware Fisheye Camera based Bird's Eye View Segmentation with Occlusion Reasoning

Senthil Yogamani, David Unger, Venkatraman Narayanan, Varun Ravi Kumar

Semantic segmentation is an effective way to perform scene understanding. Recently, segmentation in 3D Bird's Eye View (BEV) space has become popular as its directly used by drive policy. However, there is limited work on BEV segmentation for surround-view fisheye cameras, commonly used in commercial vehicles. As this task has no real-world public dataset and existing synthetic datasets do not handle amodal regions due to occlusion, we create a synthetic dataset using the Cognata simulator comprising diverse road types, weather, and lighting conditions. We generalize the BEV segmentation to work with any camera model; this is useful for mixing diverse cameras. We implement a baseline by applying cylindrical rectification on the fisheye images and using a standard LSS-based BEV segmentation model. We demonstrate that we can achieve better performance without undistortion, which has the adverse effects of increased runtime due to pre-processing, reduced field-of-view, and resampling artifacts. Further, we introduce a distortion-aware learnable BEV pooling strategy that is more effective for the fisheye cameras. We extend the model with an occlusion reasoning module, which is critical for estimating in BEV space. Qualitative performance of DaF-BEVSeg is showcased in the video at https://streamable.com/ge4v51.

4/10/2024

cs.CV cs.RO

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Ziying Song, Lei Yang, Shaoqing Xu, Lin Liu, Dongyang Xu, Caiyan Jia, Feiyang Jia, Li Wang

Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called Graph BEV. Addressing errors caused by inaccurate point cloud projection, we introduce a Local Align module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a Global Align module to rectify the misalignment between LiDAR and camera BEV features. Our Graph BEV framework achieves state-of-the-art performance, with an mAP of 70.1%, surpassing BEV Fusion by 1.6% on the nuscenes validation set. Importantly, our Graph BEV outperforms BEV Fusion by 8.3% under conditions with misalignment noise.

4/11/2024

cs.CV