Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching

2405.13874

Published 5/24/2024 by Hongkai Chen, Zixin Luo, Yurun Tian, Xuyang Bai, Ziyu Wang, Lei Zhou, Mingmin Zhen, Tian Fang, David McKinnon, Yanghai Tsin and 1 other

cs.CV

✅

Abstract

Identifying robust and accurate correspondences across images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cross-view deformations. Secondly, we present selective fusion to merge local and global messages from cross attention. Apart from network structure, we also identify the importance of enforcing spatial smoothness in loss design, which has been omitted by previous works. Based on these augmentations, our network demonstrate strong matching capacity under different settings. The full version of our network achieves state-of-the-art performance among semi-dense matching methods at a similar cost to LoFTR, while the slim version reaches LoFTR baseline's performance with only 15% computation cost and 18% parameters.

Create account to get full access

Overview

This paper proposes several improvements to recent semi-dense matching methods, which aim to identify robust and accurate correspondences across images.
Key contributions include:
- Introducing affine-based local attention to model cross-view deformations.
- Presenting selective fusion to merge local and global messages from cross attention.
- Emphasizing the importance of enforcing spatial smoothness in loss design, which was omitted in previous work.

Plain English Explanation

Matching features between different images is a fundamental problem in computer vision. It enables various downstream tasks like object recognition and 3D reconstruction. Recent methods have used Transformer-based architectures to effectively fuse relevant information across views.

This paper builds on these Transformer-based approaches with a few key improvements:

Affine-based local attention: It can be challenging to match features when the views of the same object are distorted, for example, if one image is rotated or zoomed in compared to another. The authors introduce a new attention mechanism that can better model these types of affine transformations between the views.
Selective fusion: The paper presents a way to selectively combine the local and global information learned by the Transformer, rather than just blindly fusing everything together. This helps the network focus on the most relevant details for accurate matching.
Spatial smoothness: The authors also find that explicitly encouraging the network's outputs to be spatially smooth, by including this as a term in the training loss, leads to better performance. This was overlooked in previous work.

By incorporating these improvements, the final network demonstrates state-of-the-art matching performance, while also offering efficient "slim" versions that are much smaller and faster than the previous best model.

Technical Explanation

The core of the paper is a Transformer-based architecture for semi-dense image matching. The authors build upon the LoFTR framework, which uses a Transformer to fuse cross-view information and predict feature correspondences.

To improve upon LoFTR, the authors make the following key contributions:

Affine-based Local Attention: Rather than using the standard dot-product attention mechanism, the authors introduce an "affine-based local attention" module. This allows the network to better model affine transformations between the matched views, improving robustness to geometric distortions.
Selective Fusion: The Transformer in LoFTR concatenates local and global feature representations before the final prediction layer. Instead, the authors propose a "selective fusion" module that learns to dynamically combine these representations, allowing the network to focus on the most relevant information.
Spatial Smoothness: Previous Transformer-based matching methods did not explicitly enforce spatial smoothness in the output correspondences. The authors find that including a spatial smoothness term in the training loss leads to significant performance gains.

The authors evaluate their approach on several standard benchmarks for semi-dense image matching, including the HPatches, RobotCar, and Aachen Day-Night datasets. Their full model achieves state-of-the-art results, outperforming LoFTR while only using a similar amount of compute. They also present a "slim" version that reaches LoFTR's performance using only 15% of the computational cost and 18% of the parameters.

Critical Analysis

The authors provide a thorough analysis of their proposed improvements, clearly demonstrating their effectiveness through extensive experiments. The use of affine-based local attention and selective fusion are well-motivated and lead to tangible performance gains.

One potential limitation is that the paper does not delve into the interpretability of these architectural changes. While the quantitative results are impressive, it would be valuable to better understand how the affine-based attention and selective fusion mechanisms are actually improving the matching capabilities of the network.

Additionally, the authors focus solely on semi-dense matching, which is an important but somewhat narrow domain. It would be interesting to see how these techniques generalize to other computer vision tasks, such as image deblurring or 3D object detection.

Overall, this paper represents a solid contribution to the field of image matching, offering practical improvements that can be readily applied to real-world applications. The authors have struck a good balance between technical rigor and accessibility, making their work a valuable resource for both researchers and practitioners.

Conclusion

This paper presents several key enhancements to Transformer-based semi-dense image matching methods. The introduction of affine-based local attention, selective fusion, and explicit spatial smoothness in the loss function lead to state-of-the-art matching performance at a similar computational cost to the previous best model.

These improvements demonstrate the continued potential of Transformer architectures for computer vision tasks, particularly when combined with careful attention to network design and training objectives. The authors have made a compelling case for the benefits of their approach, and their work lays the groundwork for further advancements in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Are Semi-Dense Detector-Free Methods Good at Matching Local Features?

Matthieu Vilain, R'emi Giraud, Hugo Germain, Guillaume Bourmaud

Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among the most popular image matching methods. While SDF methods are trained to establish correspondences between two images, their performances are almost exclusively evaluated using relative pose estimation metrics. Thus, the link between their ability to establish correspondences and the quality of the resulting estimated pose has thus far received little attention. This paper is a first attempt to study this link. We start with proposing a novel structured attention-based image matching architecture (SAM). It allows us to show a counter-intuitive result on two datasets (MegaDepth and HPatches): on the one hand SAM either outperforms or is on par with SDF methods in terms of pose/homography estimation metrics, but on the other hand SDF approaches are significantly better than SAM in terms of matching accuracy. We then propose to limit the computation of the matching accuracy to textured regions, and show that in this case SAM often surpasses SDF methods. Our findings highlight a strong correlation between the ability to establish accurate correspondences in textured regions and the accuracy of the resulting estimated pose/homography. Our code will be made available.

6/4/2024

cs.CV cs.AI

Fusion of regional and sparse attention in Vision Transformers

Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara

Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions, in contrast to the global attention employed in the original ViT. Regional attention restricts pixel interactions within specific regions, while sparse attention disperses them across sparse grids. These differing approaches pose a challenge between maintaining hierarchical relationships vs. capturing a global context. In this study, drawing inspiration from atrous convolution, we propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information while preserving hierarchical structures. Based on this, we introduce a versatile, hybrid vision transformer backbone called ACC-ViT, tailored for standard vision tasks. Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42% while requiring 8.4% fewer parameters.

6/14/2024

cs.CV

✨

Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence

Sunghwan Hong, Seokju Cho, Seungryong Kim, Stephen Lin

This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. In the context of dense matching, many works benefit from one of two forms of aggregation: feature aggregation, which pertains to the alignment of similar features, or cost aggregation, a procedure aimed at instilling coherence in the flow estimates across neighboring pixels. In this work, we first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. We then introduce a simple yet effective architecture that harnesses self- and cross-attention mechanisms to show that our approach unifies feature aggregation and cost aggregation and effectively harnesses the strengths of both techniques. Within the proposed attention layers, the features and cost volume both complement each other, and the attention layers are interleaved through a coarse-to-fine design to further promote accurate correspondence estimation. Finally at inference, our network produces multi-scale predictions, computes their confidence scores, and selects the most confident flow for final prediction. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.

4/23/2024

cs.CV

A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images

Wang Zhang, Tingting Li, Yuntian Zhang, Gensheng Pei, Xiruo Jiang, Yazhou Yao

Matching visible and near-infrared (NIR) images remains a significant challenge in remote sensing image fusion. The nonlinear radiometric differences between heterogeneous remote sensing images make the image matching task even more difficult. Deep learning has gained substantial attention in computer vision tasks in recent years. However, many methods rely on supervised learning and necessitate large amounts of annotated data. Nevertheless, annotated data is frequently limited in the field of remote sensing image matching. To address this challenge, this paper proposes a novel keypoint descriptor approach that obtains robust feature descriptors via a self-supervised matching network. A light-weight transformer network, termed as LTFormer, is designed to generate deep-level feature descriptors. Furthermore, we implement an innovative triplet loss function, LT Loss, to enhance the matching performance further. Our approach outperforms conventional hand-crafted local feature descriptors and proves equally competitive compared to state-of-the-art deep learning-based methods, even amidst the shortage of annotated data.

5/1/2024

cs.CV cs.MM