Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

2303.12017

Published 4/9/2024 by Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Limin Wang

🚀

Abstract

In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an early-fusion'' or late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at https://github.com/MCG-NJU/CamLiFlow.

Create account to get full access

Overview

This paper explores the problem of jointly estimating optical flow and scene flow from synchronized 2D and 3D data.
Previous methods either used a complex pipeline that split the task into independent stages, or fused 2D and 3D information in a one-size-fits-all approach.
The authors propose a novel end-to-end framework that combines 2D and 3D branches with bidirectional fusion connections.
The framework uses a point-based 3D branch to preserve the geometric structure of point clouds, and a learnable fusion module to combine dense image features and sparse point features.
Two types of bidirectional fusion pipelines are presented: one based on a pyramidal coarse-to-fine architecture, and the other on recurrent all-pairs field transforms.

Plain English Explanation

The paper addresses the challenge of estimating both 2D optical flow (the movement of pixels in an image) and 3D scene flow (the movement of 3D points in a scene) from data that combines 2D images and 3D point clouds. Previous methods either used a complex process that treated these two tasks separately, or tried to combine the 2D and 3D data in a one-size-fits-all way, which didn't fully leverage the strengths of each type of data.

The authors propose a new approach that directly combines the 2D and 3D data in an end-to-end framework. This framework has separate branches for processing the 2D and 3D data, but it also has special connections that allow the information from each branch to influence the other. For the 3D data, they use a technique that preserves the underlying geometric structure of the 3D point cloud, rather than just treating it like a unstructured set of points.

To fuse the dense information from the 2D images and the sparse information from the 3D point clouds, the authors developed a new learnable fusion module. They then instantiated two different versions of this overall framework, one based on a pyramidal architecture and the other on recurrent neural networks. Both of these new models significantly outperformed previous methods on standard benchmarks for optical flow and scene flow estimation.

Technical Explanation

The paper presents a novel end-to-end framework for jointly estimating optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employed a complex pipeline that split the task into independent stages, or used a one-size-fits-all approach to fuse the 2D and 3D information in an "early-fusion" or "late-fusion" manner.

To address these limitations, the authors propose a framework with separate 2D and 3D branches that are connected through multiple bidirectional fusion connections. For the 3D branch, they use a point-based approach to preserve the geometric structure of the point clouds, rather than treating them as unstructured sets of points. To fuse the dense 2D image features and sparse 3D point features, they introduce a learnable operator called the bidirectional camera-LiDAR fusion module (Bi-CLFM).

The authors instantiate two types of this bidirectional fusion pipeline. The first, called CamLiPWC, is based on a pyramidal coarse-to-fine architecture. The second, called CamLiRAFT, is based on recurrent all-pairs field transforms. Both models significantly outperform existing methods on the FlyingThings3D and KITTI Scene Flow benchmarks, with CamLiRAFT achieving state-of-the-art results on KITTI while using fewer parameters.

The key innovations in this work are the end-to-end framework with bidirectional fusion, the point-based 3D branch, and the learnable Bi-CLFM fusion module. These allow the model to better leverage the complementary strengths of 2D and 3D data for scene flow estimation, leading to substantial performance improvements.

Critical Analysis

The paper presents a compelling approach to the challenging problem of jointly estimating optical flow and scene flow from 2D-3D data. The authors' key insight of using bidirectional fusion connections between the 2D and 3D branches, rather than a one-size-fits-all fusion strategy, is well-justified and leads to strong empirical results.

One potential limitation is the reliance on synchronized 2D-3D data, which may not always be available in real-world scenarios. It would be interesting to see how the framework could be adapted to handle asynchronous or partial 2D-3D data. Additionally, the paper does not provide much analysis on the computational cost or inference speed of the proposed models, which are important practical considerations.

While the performance gains on standard benchmarks are impressive, the paper could be strengthened by a more in-depth discussion of the qualitative differences between the outputs of the CamLiPWC and CamLiRAFT models. Additionally, comparing the models' generalization abilities on a wider range of scenes and motion types would help demonstrate the robustness of the approach.

Overall, this is a strong technical contribution that advances the state-of-the-art in joint optical flow and scene flow estimation. The authors' novel ideas around bidirectional fusion and point-based 3D processing are worthy of further exploration and refinement.

Conclusion

This paper presents a novel end-to-end framework for jointly estimating optical flow and scene flow from synchronized 2D and 3D data. The key innovations are the use of bidirectional fusion connections between 2D and 3D branches, a point-based 3D branch to preserve geometric structure, and a learnable fusion module to combine dense image features and sparse point features.

The authors instantiate two versions of this framework, CamLiPWC and CamLiRAFT, which significantly outperform previous methods on standard benchmarks. CamLiRAFT in particular achieves state-of-the-art results on the KITTI Scene Flow challenge while using fewer parameters.

This work demonstrates the value of carefully designed fusion strategies for leveraging the complementary strengths of 2D and 3D data. The insights and techniques developed here could have broader applications in fields that require reasoning about dynamic 3D scenes, such as autonomous driving, human motion analysis, and scene understanding. Further research could explore extending the framework to handle asynchronous or partial 2D-3D data, as well as its broader applicability to long-term scene flow estimation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

Yang Song, Lin Wang

3D object detection is an important task that has been widely applied in autonomous driving. Recently, fusing multi-modal inputs, i.e., LiDAR and camera data, to perform this task has become a new trend. Existing methods, however, either ignore the sparsity of Lidar features or fail to preserve the original spatial structure of LiDAR and the semantic density of camera features simultaneously due to the modality gap. To address issues, this letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features and adaptatively select features from both modalities to build a unified 3D representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantics of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the spatial characteristics of camera features from 3D voxel features. Both VEM and IEM are bidirectionally updated to effectively reduce the modality gap. We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.

6/28/2024

cs.CV cs.AI

Camera Motion Estimation from RGB-D-Inertial Scene Flow

Samuel Cerezo, Javier Civera

In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multi-frame optimization or to marginalize older data, thus effectively utilizing past measurements. To assess the performance of our method, we conducted evaluations using both synthetic data from the ICL-NUIM dataset and real data sequences from the OpenLORIS-Scene dataset. Our results show that the fusion of these two sensors enhances the accuracy of camera motion estimation when compared to using only visual data.

4/29/2024

cs.CV

Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers

James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, Romain Mueller

Combining complementary sensor modalities is crucial to providing robust perception for safety-critical robotics applications such as autonomous driving (AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on monocular depth estimation which is a notoriously difficult task compared to using depth information from the lidar directly. Here, we find that this approach does not leverage depth as expected and show that naively improving depth estimation does not lead to improvements in object detection performance. Strikingly, we also find that removing depth estimation altogether does not degrade object detection performance substantially, suggesting that relying on monocular depth could be an unnecessary architectural bottleneck during camera-lidar fusion. In this work, we introduce a novel fusion method that bypasses monocular depth estimation altogether and instead selects and fuses camera and lidar features in a bird's-eye-view grid using a simple attention mechanism. We show that our model can modulate its use of camera features based on the availability of lidar features and that it yields better 3D object detection on the nuScenes dataset than baselines relying on monocular depth estimation.

5/22/2024

cs.CV cs.LG

Let It Flow: Simultaneous Optimization of 3D Flow and Object Clustering

Patrik Vacek, David Hurych, Tom'av{s} Svoboda, Karel Zimmermann

We study the problem of self-supervised 3D scene flow estimation from real large-scale raw point cloud sequences, which is crucial to various tasks like trajectory prediction or instance segmentation. In the absence of ground truth scene flow labels, contemporary approaches concentrate on deducing optimizing flow across sequential pairs of point clouds by incorporating structure based regularization on flow and object rigidity. The rigid objects are estimated by a variety of 3D spatial clustering methods. While state-of-the-art methods successfully capture overall scene motion using the Neural Prior structure, they encounter challenges in discerning multi-object motions. We identified the structural constraints and the use of large and strict rigid clusters as the main pitfall of the current approaches and we propose a novel clustering approach that allows for combination of overlapping soft clusters as well as non-overlapping rigid clusters representation. Flow is then jointly estimated with progressively growing non-overlapping rigid clusters together with fixed size overlapping soft clusters. We evaluate our method on multiple datasets with LiDAR point clouds, demonstrating the superior performance over the self-supervised baselines reaching new state of the art results. Our method especially excels in resolving flow in complicated dynamic scenes with multiple independently moving objects close to each other which includes pedestrians, cyclists and other vulnerable road users. Our codes are publicly available on https://github.com/ctu-vras/let-it-flow.

5/21/2024

cs.CV