Learning Structure-from-Motion with Graph Attention Networks

Read original: arXiv:2308.15984 - Published 5/21/2024 by Lucas Brynte, Jos'e Pedro Iglesias, Carl Olsson, Fredrik Kahl

🧪

Overview

The paper tackles the problem of learning Structure-from-Motion (SfM) using graph attention networks.
SfM is a classic computer vision problem that is typically solved through iterative minimization of reprojection errors, known as Bundle Adjustment (BA), starting from a good initialization.
Conventional methods rely on a sequence of sub-problems (e.g., pairwise pose estimation, pose averaging, or triangulation) to obtain a good initialization for BA.
In this work, the authors replace these sub-problems with a learned model that takes 2D keypoints detected across multiple views and outputs the corresponding camera poses and 3D keypoint coordinates.

Plain English Explanation

The paper describes a new approach to solving the Structure-from-Motion (SfM) problem, which is a fundamental challenge in computer vision. SfM involves reconstructing the 3D structure of a scene and the positions of the cameras that captured it, based on a collection of 2D images.

Traditionally, SfM has been solved using a two-step process. First, a series of sub-problems, such as estimating the relative poses of pairs of cameras or triangulating the positions of 3D points, are used to obtain an initial estimate of the scene and camera parameters. Then, this initial estimate is refined through an iterative optimization process called Bundle Adjustment (BA), which minimizes the reprojection errors between the observed 2D points and their corresponding 3D reconstructions.

In this work, the authors propose a novel approach that bypasses the need for these sub-problems. Instead, they train a machine learning model, based on graph neural networks, to directly predict the camera poses and 3D point coordinates from the 2D keypoints detected in the input images. This allows for faster inference and potentially more accurate reconstructions, as the model can learn SfM-specific primitives and patterns from data.

The experimental results show that the proposed approach outperforms other learning-based methods and even challenges the performance of the popular COLMAP SfM pipeline, while having a lower computational runtime.

Technical Explanation

The key idea of the paper is to replace the conventional sub-problems used in SfM (e.g., pairwise pose estimation, pose averaging, or triangulation) with a learned model that can directly predict the camera poses and 3D point coordinates from the 2D keypoints.

The authors leverage graph attention networks (GATs) to design their model, as these networks are well-suited for learning SfM-specific primitives and relationships between the 2D keypoints, camera poses, and 3D point coordinates. The input to the model is a graph where the nodes represent the 2D keypoints, and the edges encode the relationships between them across multiple views.

The model is trained end-to-end to predict the camera poses and 3D point coordinates from the input 2D keypoints. This allows the model to learn the underlying patterns and constraints of the SfM problem, rather than relying on a sequence of sub-problems with potential error propagation.

The experimental results show that the proposed model outperforms other learning-based SfM methods, such as NeRF-based approaches, and even challenges the performance of the popular COLMAP SfM pipeline, while having a lower runtime.

Critical Analysis

The paper presents a promising approach to learning SfM, but it is important to consider some potential caveats and limitations.

One key question is how the model would perform on more complex or challenging scenes, beyond the datasets used in the experiments. The authors mention that their approach assumes a sufficient number of detected keypoints, which may not always be the case in real-world scenarios with occlusions, textureless regions, or other challenges.

Additionally, the paper does not address the handling of dynamic or non-rigid scenes, which can be a significant challenge in SfM. It would be interesting to see how the proposed model could be extended to handle these more complex cases.

Finally, while the authors demonstrate improved performance over existing learning-based methods, it is still unclear how the model would compare to highly optimized traditional SfM pipelines in terms of accuracy and robustness, especially on a wider range of datasets and scenarios.

Overall, the paper presents an intriguing and potentially impactful approach to learning SfM, but further research and evaluation would be needed to fully assess its capabilities and limitations.

Conclusion

This paper introduces a novel approach to learning Structure-from-Motion (SfM) using graph attention networks. By replacing the traditional sub-problems with a learned model that can directly predict camera poses and 3D point coordinates from 2D keypoints, the authors demonstrate improved performance and faster inference compared to existing learning-based methods and the popular COLMAP SfM pipeline.

The key innovation is the use of graph neural networks to capture the intricate relationships between the 2D keypoints, camera poses, and 3D structure, allowing the model to learn SfM-specific primitives and constraints. This approach shows promise for accelerating and potentially improving the accuracy of 3D scene reconstruction, with potential applications in areas like robotics, augmented reality, and 3D mapping.

While the paper presents compelling results, it also highlights the need for further research to address the model's performance on more complex scenes and its robustness to real-world challenges. Exploring the model's extensibility to dynamic or non-rigid scenes would also be an interesting direction for future work.

Overall, this paper contributes a valuable step forward in the quest to develop more efficient and accurate 3D reconstruction systems, with the potential to have a significant impact on a wide range of computer vision and spatial computing applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧪

Learning Structure-from-Motion with Graph Attention Networks

Lucas Brynte, Jos'e Pedro Iglesias, Carl Olsson, Fredrik Kahl

In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provide an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime. Our code is available at https://github.com/lucasbrynte/gasfm/.

5/21/2024

Global Structure-from-Motion Revisited

Linfei Pan, D'aniel Bar'ath, Marc Pollefeys, Johannes L. Schonberger

Recovering 3D structure and camera motion from images has been a long-standing focus of computer vision research and is known as Structure-from-Motion (SfM). Solutions to this problem are categorized into incremental and global approaches. Until now, the most popular systems follow the incremental paradigm due to its superior accuracy and robustness, while global approaches are drastically more scalable and efficient. With this work, we revisit the problem of global SfM and propose GLOMAP as a new general-purpose system that outperforms the state of the art in global SfM. In terms of accuracy and robustness, we achieve results on-par or superior to COLMAP, the most widely used incremental SfM, while being orders of magnitude faster. We share our system as an open-source implementation at {https://github.com/colmap/glomap}.

7/30/2024

MCGMapper: Light-Weight Incremental Structure from Motion and Visual Localization With Planar Markers and Camera Groups

Yusen Xie, Zhenmin Huang, Kai Chen, Lei Zhu, Jun Ma

Structure from Motion (SfM) and visual localization in indoor texture-less scenes and industrial scenarios present prevalent yet challenging research topics. Existing SfM methods designed for natural scenes typically yield low accuracy or map-building failures due to insufficient robust feature extraction in such settings. Visual markers, with their artificially designed features, can effectively address these issues. Nonetheless, existing marker-assisted SfM methods encounter problems like slow running speed and difficulties in convergence; and also, they are governed by the strong assumption of unique marker size. In this paper, we propose a novel SfM framework that utilizes planar markers and multiple cameras with known extrinsics to capture the surrounding environment and reconstruct the marker map. In our algorithm, the initial poses of markers and cameras are calculated with Perspective-n-Points (PnP) in the front-end, while bundle adjustment methods customized for markers and camera groups are designed in the back-end to optimize the 6-DOF pose directly. Our algorithm facilitates the reconstruction of large scenes with different marker sizes, and its accuracy and speed of map building are shown to surpass existing methods. Our approach is suitable for a wide range of scenarios, including laboratories, basements, warehouses, and other industrial settings. Furthermore, we incorporate representative scenarios into simulations and also supply our datasets with pose labels to address the scarcity of quantitative ground-truth datasets in this research field. The datasets and source code are available on GitHub.

5/28/2024

Revisit Self-supervised Depth Estimation with Local Structure-from-Motion

Shengjie Zhu, Xiaoming Liu

Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within $5$ frames already benefits SoTA supervised depth and correspondence models. The project page is held in the link (https://shngjz.github.io/SSfM.github.io/).

8/9/2024