DarSwin: Distortion Aware Radial Swin Transformer

Read original: arXiv:2304.09691 - Published 7/25/2024 by Akshaya Athwale, Arman Afrasiyabi, Justin Lague, Ichrak Shili, Ola Ahmad, Jean-Franc{c}ois Lalonde

🌀

Overview

Wide-angle lenses are commonly used in perception tasks that require a large field of view.
These lenses produce significant distortions, making conventional models that ignore the distortion effects unable to adapt to wide-angle images.
The paper presents a novel transformer-based model called DarSwin that automatically adapts to the distortion produced by wide-angle lenses.

Plain English Explanation

The paper describes a new deep learning model called DarSwin that can work with images captured using wide-angle lenses. Wide-angle lenses are often used in applications that need to see a large area, like security cameras or augmented reality. However, these lenses can cause significant distortion in the image, which makes it hard for existing AI models to understand what's in the image.

The key idea behind DarSwin is that it takes the physics of how wide-angle lenses distort images into account. It uses a special transformer architecture that divides the image into radial patches and encodes the distortion information directly. This allows DarSwin to adapt to the distortion and perform well on tasks like object detection or depth estimation, even when the distortion level changes.

The paper also shows that DarSwin can be combined with a "self-calibration" network that can automatically estimate the distortion of the lens from the image itself. This means DarSwin can work without any prior information about the lens, making it more flexible and easier to use in real-world applications.

Technical Explanation

The DarSwin architecture takes inspiration from the Swin Transformer but incorporates several key modifications to handle the distortion introduced by wide-angle lenses:

Radial Patch Partitioning: Instead of the standard grid-based patch partitioning, DarSwin divides the input image into radial patches following the distortion profile of the wide-angle lens.
Distortion-based Sampling: To create the token embeddings, DarSwin uses a distortion-based sampling technique that samples the image at points corresponding to the undistorted grid locations.
Angular Position Encoding: DarSwin incorporates an angular position encoding to capture the radial structure of the patches during the patch merging process.

The paper shows that DarSwin outperforms other baselines on various datasets, especially when trained and tested on a range of distortion levels, including out-of-distribution distortions.

Additionally, the authors present DarSwin-Unet, which extends the DarSwin architecture to an encoder-decoder setup suitable for pixel-level tasks like depth estimation. DarSwin-Unet demonstrates zero-shot adaptation to unseen distortions of different wide-angle lenses.

Critical Analysis

The paper provides a compelling solution to the problem of adapting deep learning models to work with images captured using wide-angle lenses. The authors' approach of incorporating the physical characteristics of the lens distortion directly into the model architecture is a novel and promising direction.

One potential limitation is the requirement of knowing the radial distortion profile of the lens in advance. While the authors show that this can be addressed by combining DarSwin with a self-calibration network, the performance of this combined system could be further explored.

Additionally, the paper focuses on evaluating the models on synthetic distortions, and it would be interesting to see how they perform on real-world wide-angle images captured using various lenses. Extending the evaluation to more diverse real-world datasets could provide additional insights.

Conclusion

The DarSwin and DarSwin-Unet models presented in this paper offer a promising solution for adapting deep learning models to work with wide-angle lenses. By explicitly incorporating the physical characteristics of the lens distortion into the model architecture, these approaches demonstrate significant improvements over conventional transformer-based models, especially when dealing with a range of distortion levels.

The ability to combine DarSwin with a self-calibration network, enabling a completely uncalibrated pipeline, is a particularly noteworthy feature that enhances the practical applicability of this research. As wide-angle lenses continue to be widely used in various perception tasks, the insights and techniques presented in this paper could have a substantial impact on the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌀

DarSwin: Distortion Aware Radial Swin Transformer

Akshaya Athwale, Arman Afrasiyabi, Justin Lague, Ichrak Shili, Ola Ahmad, Jean-Franc{c}ois Lalonde

Wide-angle lenses are commonly used in perception tasks requiring a large field of view. Unfortunately, these lenses produce significant distortions, making conventional models that ignore the distortion effects unable to adapt to wide-angle images. In this paper, we present a novel transformer-based model that automatically adapts to the distortion produced by wide-angle lenses. Our proposed image encoder architecture, dubbed DarSwin, leverages the physical characteristics of such lenses analytically defined by the radial distortion profile. In contrast to conventional transformer-based architectures, DarSwin comprises a radial patch partitioning, a distortion-based sampling technique for creating token embeddings, and an angular position encoding for radial patch merging. Compared to other baselines, DarSwin achieves the best results on different datasets with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. While the base DarSwin architecture requires knowledge of the radial distortion profile, we show it can be combined with a self-calibration network that estimates such a profile from the input image itself, resulting in a completely uncalibrated pipeline. Finally, we also present DarSwin-Unet, which extends DarSwin, to an encoder-decoder architecture suitable for pixel-level tasks. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses. The code and models are publicly available at https://lvsn.github.io/darswin/

7/25/2024

DarSwin-Unet: Distortion Aware Encoder-Decoder Architecture

Akshaya Athwale, Ichrak Shili, 'Emile Bergeron, Ola Ahmad, Jean-Franc{c}ois Lalonde

Wide-angle fisheye images are becoming increasingly common for perception tasks in applications such as robotics, security, and mobility (e.g. drones, avionics). However, current models often either ignore the distortions in wide-angle images or are not suitable to perform pixel-level tasks. In this paper, we present an encoder-decoder model based on a radial transformer architecture that adapts to distortions in wide-angle lenses by leveraging the physical characteristics defined by the radial distortion profile. In contrast to the original model, which only performs classification tasks, we introduce a U-Net architecture, DarSwin-Unet, designed for pixel level tasks. Furthermore, we propose a novel strategy that minimizes sparsity when sampling the image for creating its input tokens. Our approach enhances the model capability to handle pixel-level tasks in wide-angle fisheye images, making it more effective for real-world applications. Compared to other baselines, DarSwin-Unet achieves the best results across different datasets, with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses.

7/25/2024

👀

HEAL-SWIN: A Vision Transformer On The Sphere

Oscar Carlsson, Jan E. Gerken, Hampus Linander, Heiner Spie{ss}, Fredrik Ohlsson, Christoffer Petersson, Daniel Persson

High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However, using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer, which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution, distortion-free spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer, enabling the network to process spherical representations with minimal computational overhead. We demonstrate the superior performance of our model on both synthetic and real automotive datasets, as well as a selection of other image datasets, for semantic segmentation, depth regression and classification tasks. Our code is publicly available at https://github.com/JanEGerken/HEAL-SWIN.

5/9/2024

Spread Your Wings: A Radial Strip Transformer for Image Deblurring

Duosheng Chen, Shihao Zhou, Jinshan Pan, Jinglei Shi, Lishen Qu, Jufeng Yang

Exploring motion information is important for the motion deblurring task. Recent the window-based transformer approaches have achieved decent performance in image deblurring. Note that the motion causing blurry results is usually composed of translation and rotation movements and the window-shift operation in the Cartesian coordinate system by the window-based transformer approaches only directly explores translation motion in orthogonal directions. Thus, these methods have the limitation of modeling the rotation part. To alleviate this problem, we introduce the polar coordinate-based transformer, which has the angles and distance to explore rotation motion and translation information together. In this paper, we propose a Radial Strip Transformer (RST), which is a transformer-based architecture that restores the blur images in a polar coordinate system instead of a Cartesian one. RST contains a dynamic radial embedding module (DRE) to extract the shallow feature by a radial deformable convolution. We design a polar mask layer to generate the offsets for the deformable convolution, which can reshape the convolution kernel along the radius to better capture the rotation motion information. Furthermore, we proposed a radial strip attention solver (RSAS) as deep feature extraction, where the relationship of windows is organized by azimuth and radius. This attention module contains radial strip windows to reweight image features in the polar coordinate, which preserves more useful information in rotation and translation motion together for better recovering the sharp images. Experimental results on six synthesis and real-world datasets prove that our method performs favorably against other SOTA methods for the image deblurring task.

5/24/2024