DarSwin-Unet: Distortion Aware Encoder-Decoder Architecture

Read original: arXiv:2407.17328 - Published 7/25/2024 by Akshaya Athwale, Ichrak Shili, 'Emile Bergeron, Ola Ahmad, Jean-Franc{c}ois Lalonde

DarSwin-Unet: Distortion Aware Encoder-Decoder Architecture

Overview

DarSwin-Unet is a novel encoder-decoder architecture designed to address distortion in computer vision tasks.
It combines a Swin Transformer-based encoder with a distortion-aware decoder to handle various types of distortions.
The model demonstrates improved performance on tasks like semantic segmentation and image deconvolution compared to baseline models.

Plain English Explanation

In computer vision, many real-world images and videos can be affected by distortions, such as lens aberrations, atmospheric turbulence, or camera shake. These distortions can degrade the quality and accuracy of computer vision models, making it challenging to perform tasks like object detection, image segmentation, or image enhancement.

The DarSwin-Unet model is designed to address this problem. It combines two key components:

Swin Transformer-based Encoder: The encoder part of the model uses a Swin Transformer architecture, which has been shown to be effective for a variety of computer vision tasks. The Swin Transformer is able to capture long-range dependencies in the input data, which can be helpful for dealing with distortions.
Distortion-aware Decoder: The decoder part of the model is designed to be "distortion-aware", meaning it can explicitly model and compensate for different types of distortions, such as radial or affine distortions. This helps the model produce more accurate and distortion-free output, even when the input is affected by various types of distortions.

By combining these two components, the DarSwin-Unet model is able to achieve improved performance on tasks like semantic segmentation and image deconvolution, where dealing with distortions is crucial for accurate results.

Technical Explanation

The DarSwin-Unet model consists of two main components:

Swin Transformer-based Encoder: The encoder part of the model uses a Swin Transformer architecture, which is a type of vision transformer that has been shown to be effective for a variety of computer vision tasks. The Swin Transformer uses a hierarchical structure and shifted window-based self-attention mechanism to capture long-range dependencies in the input data.
Distortion-aware Decoder: The decoder part of the model is designed to be "distortion-aware", meaning it can explicitly model and compensate for different types of distortions, such as radial or affine distortions. The decoder uses a series of distortion-aware blocks, which incorporate distortion parameters (e.g., radial distortion coefficients) into the feature maps to enable distortion-aware feature extraction and reconstruction.

The DarSwin-Unet model is evaluated on two computer vision tasks: semantic segmentation and image deconvolution. For semantic segmentation, the model demonstrates improved performance on the Cityscapes dataset compared to baseline models like U-Net and Swin-Unet. For image deconvolution, the model shows better results on the [object Object] dataset, which contains images affected by atmospheric turbulence.

Critical Analysis

The DarSwin-Unet paper presents a novel and promising approach to addressing distortion in computer vision tasks. The combination of a Swin Transformer-based encoder and a distortion-aware decoder is a unique and well-designed solution to this problem.

One potential limitation of the research is that it focuses on specific types of distortions, such as radial and affine distortions. It would be interesting to see if the model can be generalized to handle a wider range of distortions, including more complex or non-linear types of distortions.

Additionally, the paper does not provide a detailed analysis of the model's computational complexity or inference speed, which could be an important consideration for real-world applications. Further research could investigate the trade-offs between the model's performance and its computational efficiency.

Overall, the DarSwin-Unet paper presents a compelling approach to addressing distortion in computer vision and opens up interesting avenues for future research in this area.

Conclusion

The DarSwin-Unet model is a novel encoder-decoder architecture that combines a Swin Transformer-based encoder with a distortion-aware decoder to address the challenge of handling distortions in computer vision tasks. By explicitly modeling and compensating for different types of distortions, the model demonstrates improved performance on tasks like semantic segmentation and image deconvolution compared to baseline models.

This research highlights the importance of addressing distortions in computer vision and suggests that a combined approach of powerful feature extraction (via the Swin Transformer) and distortion-aware processing (via the distortion-aware decoder) can be an effective way to improve model performance in real-world scenarios. As computer vision systems become increasingly integrated into various applications, addressing distortions will be crucial for ensuring accurate and reliable results.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DarSwin-Unet: Distortion Aware Encoder-Decoder Architecture

Akshaya Athwale, Ichrak Shili, 'Emile Bergeron, Ola Ahmad, Jean-Franc{c}ois Lalonde

Wide-angle fisheye images are becoming increasingly common for perception tasks in applications such as robotics, security, and mobility (e.g. drones, avionics). However, current models often either ignore the distortions in wide-angle images or are not suitable to perform pixel-level tasks. In this paper, we present an encoder-decoder model based on a radial transformer architecture that adapts to distortions in wide-angle lenses by leveraging the physical characteristics defined by the radial distortion profile. In contrast to the original model, which only performs classification tasks, we introduce a U-Net architecture, DarSwin-Unet, designed for pixel level tasks. Furthermore, we propose a novel strategy that minimizes sparsity when sampling the image for creating its input tokens. Our approach enhances the model capability to handle pixel-level tasks in wide-angle fisheye images, making it more effective for real-world applications. Compared to other baselines, DarSwin-Unet achieves the best results across different datasets, with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses.

7/25/2024

🌀

DarSwin: Distortion Aware Radial Swin Transformer

Akshaya Athwale, Arman Afrasiyabi, Justin Lague, Ichrak Shili, Ola Ahmad, Jean-Franc{c}ois Lalonde

Wide-angle lenses are commonly used in perception tasks requiring a large field of view. Unfortunately, these lenses produce significant distortions, making conventional models that ignore the distortion effects unable to adapt to wide-angle images. In this paper, we present a novel transformer-based model that automatically adapts to the distortion produced by wide-angle lenses. Our proposed image encoder architecture, dubbed DarSwin, leverages the physical characteristics of such lenses analytically defined by the radial distortion profile. In contrast to conventional transformer-based architectures, DarSwin comprises a radial patch partitioning, a distortion-based sampling technique for creating token embeddings, and an angular position encoding for radial patch merging. Compared to other baselines, DarSwin achieves the best results on different datasets with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. While the base DarSwin architecture requires knowledge of the radial distortion profile, we show it can be combined with a self-calibration network that estimates such a profile from the input image itself, resulting in a completely uncalibrated pipeline. Finally, we also present DarSwin-Unet, which extends DarSwin, to an encoder-decoder architecture suitable for pixel-level tasks. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses. The code and models are publicly available at https://lvsn.github.io/darswin/

7/25/2024

🖼️

Ground-based Image Deconvolution with Swin Transformer UNet

Utsav Akhaury, Pascale Jablonka, Jean-Luc Starck, Fr'ed'eric Courbin

As ground-based all-sky astronomical surveys will gather millions of images in the coming years, a critical requirement emerges for the development of fast deconvolution algorithms capable of efficiently improving the spatial resolution of these images. By successfully recovering clean and high-resolution images from these surveys, the objective is to deepen the understanding of galaxy formation and evolution through accurate photometric measurements. We introduce a two-step deconvolution framework using a Swin Transformer architecture. Our study reveals that the deep learning-based solution introduces a bias, constraining the scope of scientific analysis. To address this limitation, we propose a novel third step relying on the active coefficients in the sparsity wavelet framework. We conducted a performance comparison between our deep learning-based method and Firedec, a classical deconvolution algorithm, based on an analysis of a subset of the EDisCS cluster samples. We demonstrate the advantage of our method in terms of resolution recovery, generalisation to different noise properties, and computational efficiency. The analysis of this cluster sample not only allowed us to assess the efficiency of our method, but it also enabled us to quantify the number of clumps within these galaxies in relation to their disc colour. This robust technique that we propose holds promise for identifying structures in the distant universe through ground-based images.

6/5/2024

✨

WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion

Bin Wang, Fei Deng, Peifan Jiang, Shuang Wang, Xiao Han, Zhixuan Zhang

Low-dose computed tomography (LDCT) has become the technology of choice for diagnostic medical imaging, given its lower radiation dose compared to standard CT, despite increasing image noise and potentially affecting diagnostic accuracy. To address this, advanced deep learning-based LDCT denoising algorithms have been developed, primarily using Convolutional Neural Networks (CNNs) or Transformer Networks with the Unet architecture. This architecture enhances image detail by integrating feature maps from the encoder and decoder via skip connections. However, current methods often overlook enhancements to the Unet architecture itself, focusing instead on optimizing encoder and decoder structures. This approach can be problematic due to the significant differences in feature map characteristics between the encoder and decoder, where simple fusion strategies may not effectively reconstruct images.In this paper, we introduce WiTUnet, a novel LDCT image denoising method that utilizes nested, dense skip pathways instead of traditional skip connections to improve feature integration. WiTUnet also incorporates a windowed Transformer structure to process images in smaller, non-overlapping segments, reducing computational load. Additionally, the integration of a Local Image Perception Enhancement (LiPe) module in both the encoder and decoder replaces the standard multi-layer perceptron (MLP) in Transformers, enhancing local feature capture and representation. Through extensive experimental comparisons, WiTUnet has demonstrated superior performance over existing methods in key metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), significantly improving noise removal and image quality.

4/30/2024