DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception

2305.03724

Published 6/13/2024 by Yunze Man, Liang-Yan Gui, Yu-Xiong Wang

🎯

Abstract

Closing the domain gap between training and deployment and incorporating multiple sensor modalities are two challenging yet critical topics for self-driving. Existing work only focuses on single one of the above topics, overlooking the simultaneous domain and modality shift which pervasively exists in real-world scenarios. A model trained with multi-sensor data collected in Europe may need to run in Asia with a subset of input sensors available. In this work, we propose DualCross, a cross-modality cross-domain adaptation framework to facilitate the learning of a more robust monocular bird's-eye-view (BEV) perception model, which transfers the point cloud knowledge from a LiDAR sensor in one domain during the training phase to the camera-only testing scenario in a different domain. This work results in the first open analysis of cross-domain cross-sensor perception and adaptation for monocular 3D tasks in the wild. We benchmark our approach on large-scale datasets under a wide range of domain shifts and show state-of-the-art results against various baselines.

Create account to get full access

Overview

Closing the domain gap between training and deployment is a critical challenge for self-driving car systems
Incorporating multiple sensor modalities (e.g., camera, lidar) is another key issue for robust perception
Existing work has focused on these challenges separately, overlooking the real-world scenario where both domain and modality shifts exist
This paper proposes "DualCross", a framework to adapt a monocular bird's-eye-view (BEV) perception model across domains and sensor modalities

Plain English Explanation

Self-driving car systems need to work reliably in the real world, which can be very different from the conditions they were trained on. For example, a model trained on data collected in Europe may need to operate in Asia, with a different environment and potentially fewer sensors available, like only a camera instead of the full suite of sensors used during training.

The DualCross framework proposed in this paper aims to address this challenge. It can take a model trained on data from one domain (e.g., Europe) and sensor setup (e.g., camera + lidar), and adapt it to work in a different domain (e.g., Asia) and with a reduced sensor set (e.g., just a camera). This allows the model to leverage the 3D information from the lidar sensor during training, while still being able to run effectively with only a monocular camera at deployment.

The key idea is to transfer the 3D perception knowledge learned from the lidar sensor to the monocular camera, even when the environments are quite different. This makes the model more robust to the shifts in both domain and sensor modality that can occur in real-world self-driving scenarios.

Technical Explanation

The DualCross framework consists of two main components:

Cross-Modality Adaptation: This module transfers the 3D perception knowledge learned from the lidar sensor to the monocular camera. It does this by aligning the features from the two sensor modalities, leveraging techniques like contrastive alignment to bridge the gap between their feature representations.
Cross-Domain Adaptation: This component adapts the model to operate effectively in the target domain, which may have significantly different characteristics than the source domain used for training. It uses techniques like adversarial domain adaptation to align the feature distributions across domains.

By combining these two adaptation mechanisms, DualCross can learn a monocular BEV perception model that is robust to both domain shifts and changes in sensor modality. The authors benchmark their approach on large-scale datasets and show state-of-the-art results compared to various baselines.

Critical Analysis

The paper presents a comprehensive solution to an important real-world problem in self-driving car perception. However, there are a few potential limitations and areas for further research:

The experiments are conducted on relatively high-level perception tasks like object detection. It would be interesting to see how the approach scales to more fine-grained tasks like instance segmentation or tracking.
The paper focuses on adapting between two specific domains (e.g., Europe and Asia). Further investigation is needed to understand how well the approach generalizes to a wider range of domains.
The cross-modality adaptation relies on having access to paired sensor data (e.g., camera and lidar) in the source domain. It's unclear how the method would perform if such paired data is not available.

Overall, the DualCross framework represents a significant step forward in making self-driving systems more robust and deployable in the real world. By jointly addressing domain and modality shifts, it helps bridge the gap between the controlled conditions of the lab and the unpredictability of the open road.

Conclusion

This paper introduces the DualCross framework, which tackles two critical challenges in self-driving car perception: closing the domain gap between training and deployment, and incorporating multiple sensor modalities. By combining cross-modality and cross-domain adaptation techniques, DualCross can learn a monocular bird's-eye-view perception model that is robust to both types of shifts.

The authors' comprehensive evaluations and state-of-the-art results demonstrate the effectiveness of this approach, which represents an important step towards deploying self-driving systems in the real world. As the technology continues to mature, solutions like DualCross will be crucial for ensuring self-driving cars can operate safely and reliably in diverse environments and conditions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

UniMODE: Unified Monocular 3D Object Detection

Zhuoling Li, Xiaogang Xu, SerNam Lim, Hengshuang Zhao

Realizing unified monocular 3D object detection, including both indoor and outdoor scenes, holds great importance in applications like robot navigation. However, involving various scenarios of data to train models poses challenges due to their significantly different characteristics, e.g., diverse geometry properties and heterogeneous domain distributions. To address these challenges, we build a detector based on the bird's-eye-view (BEV) detection paradigm, where the explicit feature projection is beneficial to addressing the geometry learning ambiguity when employing multiple scenarios of data to train detectors. Then, we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by the aforementioned challenges. Moreover, we develop a sparse BEV feature projection strategy to reduce computational cost and a unified domain alignment method to handle heterogeneous domains. Combining these techniques, a unified detector UniMODE is derived, which surpasses the previous state-of-the-art on the challenging Omni3D dataset (a large-scale dataset including both indoor and outdoor scenes) by 4.9% AP_3D, revealing the first successful generalization of a BEV detector to unified 3D object detection.

5/10/2024

cs.CV

DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation

Yueru Luo, Shuguang Cui, Zhen Li

Accurate 3D lane estimation is crucial for ensuring safety in autonomous driving. However, prevailing monocular techniques suffer from depth loss and lighting variations, hampering accurate 3D lane detection. In contrast, LiDAR points offer geometric cues and enable precise localization. In this paper, we present DV-3DLane, a novel end-to-end Dual-View multi-modal 3D Lane detection framework that synergizes the strengths of both images and LiDAR points. We propose to learn multi-modal features in dual-view spaces, i.e., perspective view (PV) and bird's-eye-view (BEV), effectively leveraging the modal-specific information. To achieve this, we introduce three designs: 1) A bidirectional feature fusion strategy that integrates multi-modal features into each view space, exploiting their unique strengths. 2) A unified query generation approach that leverages lane-aware knowledge from both PV and BEV spaces to generate queries. 3) A 3D dual-view deformable attention mechanism, which aggregates discriminative features from both PV and BEV spaces into queries for accurate 3D lane detection. Extensive experiments on the public benchmark, OpenLane, demonstrate the efficacy and efficiency of DV-3DLane. It achieves state-of-the-art performance, with a remarkable 11.2 gain in F1 score and a substantial 53.5% reduction in errors. The code is available at url{https://github.com/JMoonr/dv-3dlane}.

6/26/2024

cs.CV

Multimodal 3D Object Detection on Unseen Domains

Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J. Jones, Vishal M. Patel

LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.

4/19/2024

cs.CV

Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving

Shaoyuan Xie, Lingdong Kong, Wenwei Zhang, Jiawei Ren, Liang Pan, Kai Chen, Ziwei Liu

Recent advancements in bird's eye view (BEV) representations have shown remarkable promise for in-vehicle 3D perception. However, while these methods have achieved impressive results on standard benchmarks, their robustness in varied conditions remains insufficiently assessed. In this study, we present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms. This suite incorporates a diverse set of camera corruption types, each examined over three severity levels. Our benchmarks also consider the impact of complete sensor failures that occur when using multi-modal models. Through RoboBEV, we assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction. Our analyses reveal a noticeable correlation between the model's performance on in-distribution datasets and its resilience to out-of-distribution challenges. Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data. Furthermore, we observe that leveraging extensive temporal information significantly improves the model's robustness. Based on our observations, we design an effective robustness enhancement strategy based on the CLIP model. The insights from this study pave the way for the development of future BEV models that seamlessly combine accuracy with real-world robustness.

5/28/2024

cs.CV cs.RO