WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion

2404.09533

Published 4/30/2024 by Bin Wang, Fei Deng, Peifan Jiang, Shuang Wang, Xiao Han, Zhixuan Zhang

✨

Abstract

Low-dose computed tomography (LDCT) has become the technology of choice for diagnostic medical imaging, given its lower radiation dose compared to standard CT, despite increasing image noise and potentially affecting diagnostic accuracy. To address this, advanced deep learning-based LDCT denoising algorithms have been developed, primarily using Convolutional Neural Networks (CNNs) or Transformer Networks with the Unet architecture. This architecture enhances image detail by integrating feature maps from the encoder and decoder via skip connections. However, current methods often overlook enhancements to the Unet architecture itself, focusing instead on optimizing encoder and decoder structures. This approach can be problematic due to the significant differences in feature map characteristics between the encoder and decoder, where simple fusion strategies may not effectively reconstruct images.In this paper, we introduce WiTUnet, a novel LDCT image denoising method that utilizes nested, dense skip pathways instead of traditional skip connections to improve feature integration. WiTUnet also incorporates a windowed Transformer structure to process images in smaller, non-overlapping segments, reducing computational load. Additionally, the integration of a Local Image Perception Enhancement (LiPe) module in both the encoder and decoder replaces the standard multi-layer perceptron (MLP) in Transformers, enhancing local feature capture and representation. Through extensive experimental comparisons, WiTUnet has demonstrated superior performance over existing methods in key metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), significantly improving noise removal and image quality.

Create account to get full access

Overview

Low-dose computed tomography (LDCT) is a medical imaging technique that uses less radiation compared to standard CT scans
However, LDCT can result in increased image noise, potentially affecting diagnostic accuracy
Researchers have developed advanced deep learning-based algorithms to denoise LDCT images, primarily using Convolutional Neural Networks (CNNs) or Transformer Networks with the Unet architecture
The Unet architecture enhances image detail by integrating feature maps from the encoder and decoder via skip connections
Current methods often focus on optimizing the encoder and decoder structures, but may not effectively address the differences in feature map characteristics between the two

Plain English Explanation

LDCT scans are becoming more common in medical imaging because they use less radiation than standard CT scans. However, the trade-off is that LDCT images can be noisier, which could make it harder for doctors to accurately diagnose patients. To address this, researchers have developed advanced deep learning-based algorithms to clean up the noise in LDCT images.

These algorithms often use a type of neural network architecture called Unet, which helps enhance the details in the images by combining information from different parts of the network. However, current methods tend to focus on optimizing the individual components of the Unet architecture, rather than addressing the fundamental differences in the types of features captured by the encoder (the part that analyzes the image) and the decoder (the part that reconstructs the image).

Technical Explanation

In this paper, the researchers introduce a new method called WiTUnet, which aims to improve upon the traditional Unet architecture for LDCT denoising. The key innovations in WiTUnet include:

Nested, Dense Skip Pathways: Instead of using standard skip connections to integrate features between the encoder and decoder, WiTUnet employs a more sophisticated approach called "nested, dense skip pathways." This helps to better bridge the gap between the different types of features captured by the encoder and decoder.
Windowed Transformer Structure: WiTUnet incorporates a "windowed Transformer" structure, which processes the image in smaller, non-overlapping segments. This reduces the computational load compared to processing the entire image at once.
Local Image Perception Enhancement (LiPe) Module: The researchers replace the standard multi-layer perceptron (MLP) used in Transformers with the LiPe module, which enhances the capture and representation of local features in the image.

Through extensive experiments, the researchers demonstrate that WiTUnet outperforms existing LDCT denoising methods in key metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE). This suggests that their innovations, particularly the nested, dense skip pathways and the LiPe module, are effective in improving noise removal and image quality.

Critical Analysis

The researchers acknowledge that their method, while effective, may still have some limitations. For example, they note that the windowed Transformer structure, while reducing computational load, may not fully capture global image features. Additionally, the LiPe module, while enhancing local feature representation, may not be as effective in capturing long-range dependencies in the image.

Further research could explore ways to balance the trade-offs between local and global feature extraction, potentially by incorporating hybrid approaches that combine the strengths of different neural network architectures. Additionally, the researchers could investigate lightweight or efficient variants of their WiTUnet model to further improve its practical applicability in clinical settings.

Conclusion

The paper introduces a novel LDCT denoising method called WiTUnet, which utilizes several innovative architectural components to enhance image quality and reduce computational load. The key contributions include the nested, dense skip pathways, the windowed Transformer structure, and the Local Image Perception Enhancement (LiPe) module. Experimental results demonstrate that WiTUnet outperforms existing LDCT denoising algorithms, suggesting that these architectural improvements can significantly improve the clinical utility of LDCT imaging.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🌐

LUCF-Net: Lightweight U-shaped Cascade Fusion Network for Medical Image Segmentation

Songkai Sun, Qingshan She, Yuliang Ma, Rihui Li, Yingchun Zhang

In this study, the performance of existing U-shaped neural network architectures was enhanced for medical image segmentation by adding Transformer. Although Transformer architectures are powerful at extracting global information, its ability to capture local information is limited due to its high complexity. To address this challenge, we proposed a new lightweight U-shaped cascade fusion network (LUCF-Net) for medical image segmentation. It utilized an asymmetrical structural design and incorporated both local and global modules to enhance its capacity for local and global modeling. Additionally, a multi-layer cascade fusion decoding network was designed to further bolster the network's information fusion capabilities. Validation results achieved on multi-organ datasets in CT format, cardiac segmentation datasets in MRI format, and dermatology datasets in image format demonstrated that the proposed model outperformed other state-of-the-art methods in handling local-global information, achieving an improvement of 1.54% in Dice coefficient and 2.6 mm in Hausdorff distance on multi-organ segmentation. Furthermore, as a network that combines Convolutional Neural Network and Transformer architectures, it achieves competitive segmentation performance with only 6.93 million parameters and 6.6 gigabytes of floating point operations, without the need of pre-training. In summary, the proposed method demonstrated enhanced performance while retaining a simpler model design compared to other Transformer-based segmentation networks.

4/12/2024

eess.IV cs.CV cs.LG

Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis

Vu Minh Hieu Phan, Yutong Xie, Bowen Zhang, Yuankai Qi, Zhibin Liao, Antonios Perperidis, Son Lam Phung, Johan W. Verjans, Minh-Son To

Unpaired medical image synthesis aims to provide complementary information for an accurate clinical diagnostics, and address challenges in obtaining aligned multi-modal medical scans. Transformer-based models excel in imaging translation tasks thanks to their ability to capture long-range dependencies. Although effective in supervised training settings, their performance falters in unpaired image synthesis, particularly in synthesizing structural details. This paper empirically demonstrates that, lacking strong inductive biases, Transformer can converge to non-optimal solutions in the absence of paired data. To address this, we introduce UNet Structured Transformer (UNest), a novel architecture incorporating structural inductive biases for unpaired medical image synthesis. We leverage the foundational Segment-Anything Model to precisely extract the foreground structure and perform structural attention within the main anatomy. This guides the model to learn key anatomical regions, thus improving structural synthesis under the lack of supervision in unpaired training. Evaluated on two public datasets, spanning three modalities, i.e., MR, CT, and PET, UNest improves recent methods by up to 19.30% across six medical image synthesis tasks. Our code is released at https://github.com/HieuPhan33/MICCAI2024-UNest.

6/28/2024

cs.CV

🌐

GCtx-UNet: Efficient Network for Medical Image Segmentation

Khaled Alrfou, Tian Zhao

Medical image segmentation is crucial for disease diagnosis and monitoring. Though effective, the current segmentation networks such as UNet struggle with capturing long-range features. More accurate models such as TransUNet, Swin-UNet, and CS-UNet have higher computation complexity. To address this problem, we propose GCtx-UNet, a lightweight segmentation architecture that can capture global and local image features with accuracy better or comparable to the state-of-the-art approaches. GCtx-UNet uses vision transformer that leverages global context self-attention modules joined with local self-attention to model long and short range spatial dependencies. GCtx-UNet is evaluated on the Synapse multi-organ abdominal CT dataset, the ACDC cardiac MRI dataset, and several polyp segmentation datasets. In terms of Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) metrics, GCtx-UNet outperformed CNN-based and Transformer-based approaches, with notable gains in the segmentation of complex and small anatomical structures. Moreover, GCtx-UNet is much more efficient than the state-of-the-art approaches with smaller model size, lower computation workload, and faster training and inference speed, making it a practical choice for clinical applications.

6/11/2024

eess.IV cs.CV cs.LG

UnWave-Net: Unrolled Wavelet Network for Compton Tomography Image Reconstruction

Ishak Ayad, C'ecilia Tarpau, Javier Cebeiro, Mai K. Nguyen

Computed tomography (CT) is a widely used medical imaging technique to scan internal structures of a body, typically involving collimation and mechanical rotation. Compton scatter tomography (CST) presents an interesting alternative to conventional CT by leveraging Compton physics instead of collimation to gather information from multiple directions. While CST introduces new imaging opportunities with several advantages such as high sensitivity, compactness, and entirely fixed systems, image reconstruction remains an open problem due to the mathematical challenges of CST modeling. In contrast, deep unrolling networks have demonstrated potential in CT image reconstruction, despite their computationally intensive nature. In this study, we investigate the efficiency of unrolling networks for CST image reconstruction. To address the important computational cost required for training, we propose UnWave-Net, a novel unrolled wavelet-based reconstruction network. This architecture includes a non-local regularization term based on wavelets, which captures long-range dependencies within images and emphasizes the multi-scale components of the wavelet transform. We evaluate our approach using a CST of circular geometry which stays completely static during data acquisition, where UnWave-Net facilitates image reconstruction in the absence of a specific reconstruction formula. Our method outperforms existing approaches and achieves state-of-the-art performance in terms of SSIM and PSNR, and offers an improved computational efficiency compared to traditional unrolling networks.

6/6/2024

eess.IV cs.CV