From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration

Read original: arXiv:2212.09298 - Published 4/30/2024 by Zekun Qian, Ruize Han, Wei Feng, Feifan Wang, Song Wang

🌿

Overview

Tackles the problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration
Aims to localize and orient both subjects and cameras in a unified BEV using only first-person view (FPV) RGB images
Proposes an end-to-end framework with three key components: view-transform subject detection, geometric camera registration, and spatial-appearance information aggregation
Collects a new large-scale synthetic dataset for evaluation

Plain English Explanation

This research addresses a challenging problem in computer vision - how to create a unified bird's eye view (BEV) of a multi-person scene using only multiple first-person view (FPV) camera images, without any prior information about the camera positions or orientations.

The key idea is to develop an automated system that can take these FPV images, detect and localize the people in the scene, and then also figure out the positions and viewing directions of the cameras themselves. This allows all the information to be combined into a single BEV representation, which could be useful for applications like surveillance, autonomous driving, or sports analytics.

The researchers propose a three-part framework to solve this problem. First, they use a "view-transform subject detection module" to process each FPV image and extract the locations and orientations of the people. Next, they derive a "geometric transformation method" to estimate the position and viewing direction of each camera. Finally, they aggregate all this information into a single unified BEV representation.

The researchers also collected a new large-scale synthetic dataset to evaluate their framework, and the results show it can effectively solve this challenging multi-view registration problem.

Technical Explanation

The core of the proposed framework is a three-part approach. First, the "view-transform subject detection module" takes the FPV images and extracts the 2D locations and orientations of each person in a virtual BEV. This allows the subjects to be localized without requiring any prior camera calibration.

Next, the researchers derive a "geometric transformation method" to estimate the 3D position and viewing direction of each camera. This is done by analyzing the spatial and appearance relationships between the detected subjects across the different FPV images.

Finally, the system aggregates all this information - the subject locations/orientations and the camera poses - into a single unified BEV representation. This allows the full scene to be visualized and analyzed from an overhead perspective.

The researchers evaluated their framework on a new large-scale synthetic dataset they collected, which includes rich annotations of the ground truth subject and camera positions. The experimental results demonstrate the effectiveness of their approach in solving this challenging multi-view registration problem.

Critical Analysis

The paper tackles an important and difficult problem in computer vision, with potential applications in areas like surveillance, autonomous driving, and sports analytics. The proposed framework is technically sophisticated, leveraging methods like view-transform subject detection and geometric camera registration.

However, the reliance on synthetic training data is a potential limitation. While the dataset appears large and diverse, it may not fully capture the complexity and variability of real-world multi-person scenes. Scaling to work with real-world data could be an area for future research.

Additionally, the paper does not provide much insight into the runtime performance or computational efficiency of the framework. Achieving real-time performance would be important for many practical applications.

Overall, this research represents an interesting and technically impressive approach to the problem of multi-view camera and subject registration in BEV. While there are some potential limitations, the work demonstrates the value of developing robust computer vision techniques for complex scene understanding tasks.

Conclusion

This paper tackles the challenging problem of multi-view camera and subject registration in the bird's eye view (BEV), without requiring any prior camera calibration information. The proposed end-to-end framework uses a three-part approach to extract subject locations/orientations, estimate camera poses, and aggregate this data into a unified BEV representation.

The researchers' novel techniques, including view-transform subject detection and geometric camera registration, show promising results on a new large-scale synthetic dataset. This work has the potential to enable more advanced scene understanding capabilities for applications like surveillance, autonomous driving, and sports analytics. While there are some limitations to address, this research represents an important step forward in tackling this challenging computer vision problem.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration

Zekun Qian, Ruize Han, Wei Feng, Feifan Wang, Song Wang

We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.

4/30/2024

Improved Single Camera BEV Perception Using Multi-Camera Training

Daniel Busch, Ido Freeman, Richard Meyes, Tobias Meisen

Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.

9/5/2024

↗️

New!DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences

Peidong Li, Wancheng Shen, Qihao Huang, Dixiao Cui

Camera-based Bird's-Eye-View (BEV) perception often struggles between adopting 3D-to-2D or 2D-to-3D view transformation (VT). The 3D-to-2D VT typically employs resource-intensive Transformer to establish robust correspondences between 3D and 2D features, while the 2D-to-3D VT utilizes the Lift-Splat-Shoot (LSS) pipeline for real-time application, potentially missing distant information. To address these limitations, we propose DualBEV, a unified framework that utilizes a shared feature transformation incorporating three probabilistic measurements for both strategies. By considering dual-view correspondences in one stage, DualBEV effectively bridges the gap between these strategies, harnessing their individual strengths. Our method achieves state-of-the-art performance without Transformer, delivering comparable efficiency to the LSS approach, with 55.2% mAP and 63.4% NDS on the nuScenes test set. Code is available at url{https://github.com/PeidongLi/DualBEV}

9/16/2024

🤯

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han

Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.

9/4/2024