A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

Read original: arXiv:2406.08046 - Published 6/13/2024 by Sasidhar Alavala, Anil Kumar Vadde, Aparnamala Kancheti, Subrahmanyam Gorthi

A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

Overview

This paper presents a robust pipeline for classifying and detecting bleeding frames in wireless capsule endoscopy (WCE) videos using the Swin Transformer and RT-DETR (Real-Time Detection Transformer) models.
The pipeline aims to improve the accuracy and efficiency of detecting gastrointestinal bleeding in WCE, which is a critical task for early diagnosis and treatment.
The authors utilize the Swin Transformer for frame-level classification of bleeding vs. non-bleeding frames, and RT-DETR for object-level detection of bleeding regions within the frames.

Plain English Explanation

The paper focuses on developing a system to automatically identify and locate bleeding in video footage from wireless capsule endoscopies. Wireless capsule endoscopy is a procedure where a patient swallows a small camera inside a pill, which then takes images of the digestive tract as it passes through. Being able to quickly and accurately detect signs of bleeding in these videos is crucial for doctors to diagnose and treat gastrointestinal issues early.

The researchers used two advanced machine learning models to build their pipeline. The Swin Transformer is used to classify each frame of the video as either showing bleeding or not. Then, the RT-DETR model is used to draw bounding boxes around the specific areas within each frame that contain bleeding. By combining these two models, the system can both identify that there is bleeding present and also precisely locate where it is occurring.

The key innovation here is bringing together these state-of-the-art computer vision techniques to create a robust and efficient pipeline for this important medical application. The authors demonstrate that their approach outperforms previous methods, making it a valuable tool to assist doctors in analyzing wireless capsule endoscopy footage.

Technical Explanation

The authors' pipeline first uses the Swin Transformer for frame-level classification of bleeding vs. non-bleeding frames. The Swin Transformer is a type of neural network architecture that has shown strong performance on a variety of computer vision tasks. It is well-suited for this application due to its ability to effectively model the spatial relationships within the endoscopy images.

For the object-level detection of bleeding regions, the pipeline employs the RT-DETR model. RT-DETR is a real-time object detection transformer that can accurately locate and delineate bleeding areas within each frame. This allows the system to not only identify that bleeding is present, but also pinpoint the specific locations.

The authors evaluate their pipeline on a dataset of 22,000 WCE frames, annotated for the presence and location of bleeding. They demonstrate that their approach achieves state-of-the-art performance, with high accuracy in both the frame-level classification and object-level detection tasks. Importantly, the pipeline also maintains real-time inference speeds, making it practical for clinical deployment.

Critical Analysis

The paper presents a well-designed and thorough study, with a clear focus on developing a robust and efficient system for a critical medical application. The authors' choice of the Swin Transformer and RT-DETR models appears well-justified, as these are cutting-edge techniques that have shown strong results in related computer vision tasks.

However, the paper does not delve into some potential limitations or areas for further improvement. For example, it would be interesting to understand how the pipeline might perform on more challenging or diverse WCE datasets, or how it compares to human expert performance on this task. Additionally, the authors do not discuss potential biases or failure modes of the models, which is an important consideration for real-world clinical deployment.

That said, the core contribution of the paper - demonstrating the effectiveness of combining advanced vision models for this application - is significant. The authors have made a valuable step forward in automating the analysis of wireless capsule endoscopy footage, which could have important implications for improving the speed and accuracy of gastrointestinal disease diagnosis and treatment.

Conclusion

This paper presents a novel pipeline that leverages the Swin Transformer and RT-DETR models to provide robust and efficient classification and detection of bleeding in wireless capsule endoscopy videos. By combining these state-of-the-art computer vision techniques, the authors have created a system that can accurately identify the presence of bleeding and precisely locate the affected areas, making it a valuable tool for early diagnosis and treatment of gastrointestinal issues. While the paper could delve further into potential limitations, it represents an important advancement in automating the analysis of this critical medical imaging data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR

Sasidhar Alavala, Anil Kumar Vadde, Aparnamala Kancheti, Subrahmanyam Gorthi

In this paper, we present our approach to the Auto WCEBleedGen Challenge V2 2024. Our solution combines the Swin Transformer for the initial classification of bleeding frames and RT-DETR for further detection of bleeding in Wireless Capsule Endoscopy (WCE), enhanced by a series of image preprocessing steps. These steps include converting images to Lab colour space, applying Contrast Limited Adaptive Histogram Equalization (CLAHE) for better contrast, and using Gaussian blur to suppress artefacts. The Swin Transformer utilizes a tiered architecture with shifted windows to efficiently manage self-attention calculations, focusing on local windows while enabling cross-window interactions. RT-DETR features an efficient hybrid encoder for fast processing of multi-scale features and an uncertainty-minimal query selection for enhanced accuracy. The class activation maps by Ablation-CAM are plausible to the model's decisions. On the validation set, this approach achieves a classification accuracy of 98.5% (best among the other state-of-the-art models) compared to 91.7% without any pre-processing and an $text{AP}_{50}$ of 66.7% compared to 65.0% with state-of-the-art YOLOv8. On the test set, this approach achieves a classification accuracy and F1 score of 87.0% and 89.0% respectively.

6/13/2024

WCEbleedGen: A wireless capsule endoscopy dataset and its benchmarking for automatic bleeding classification, detection, and segmentation

Palak Handa, Manas Dhir, Amirreza Mahbod, Florian Schwarzhans, Ramona Woitek, Nidhi Goel, Deepak Gunjan

Computer-based analysis of Wireless Capsule Endoscopy (WCE) is crucial. However, a medically annotated WCE dataset for training and evaluation of automatic classification, detection, and segmentation of bleeding and non-bleeding frames is currently lacking. The present work focused on development of a medically annotated WCE dataset called WCEbleedGen for automatic classification, detection, and segmentation of bleeding and non-bleeding frames. It comprises 2,618 WCE bleeding and non-bleeding frames which were collected from various internet resources and existing WCE datasets. A comprehensive benchmarking and evaluation of the developed dataset was done using nine classification-based, three detection-based, and three segmentation-based deep learning models. The dataset is of high-quality, is class-balanced and contains single and multiple bleeding sites. Overall, our standard benchmark results show that Visual Geometric Group (VGG) 19, You Only Look Once version 8 nano (YOLOv8n), and Link network (Linknet) performed best in automatic classification, detection, and segmentation-based evaluations, respectively. Automatic bleeding diagnosis is crucial for WCE video interpretations. This diverse dataset will aid in developing of real-time, multi-task learning-based innovative solutions for automatic bleeding diagnosis in WCE. The dataset and code are publicly available at https://zenodo.org/records/10156571 and https://github.com/misahub2023/Benchmarking-Codes-of-the-WCEBleedGen-dataset.

8/23/2024

Classification of Endoscopy and Video Capsule Images using CNN-Transformer Model

Aliza Subedi, Smriti Regmi, Nisha Regmi, Bhumi Bhusal, Ulas Bagci, Debesh Jha

Gastrointestinal cancer is a leading cause of cancer-related incidence and death, making it crucial to develop novel computer-aided diagnosis systems for early detection and enhanced treatment. Traditional approaches rely on the expertise of gastroenterologists to identify diseases; however, this process is subjective, and interpretation can vary even among expert clinicians. Considering recent advancements in classifying gastrointestinal anomalies and landmarks in endoscopic and video capsule endoscopy images, this study proposes a hybrid model that combines the advantages of Transformers and Convolutional Neural Networks (CNNs) to enhance classification performance. Our model utilizes DenseNet201 as a CNN branch to extract local features and integrates a Swin Transformer branch for global feature understanding, combining both to perform the classification task. For the GastroVision dataset, our proposed model demonstrates excellent performance with Precision, Recall, F1 score, Accuracy, and Matthews Correlation Coefficient (MCC) of 0.8320, 0.8386, 0.8324, 0.8386, and 0.8191, respectively, showcasing its robustness against class imbalance and surpassing other CNNs as well as the Swin Transformer model. Similarly, for the Kvasir-Capsule, a large video capsule endoscopy dataset, our model outperforms all others, achieving overall Precision, Recall, F1 score, Accuracy, and MCC of 0.7007, 0.7239, 0.6900, 0.7239, and 0.3871. Moreover, we generated saliency maps to explain our model's focus areas, demonstrating its reliable decision-making process. The results underscore the potential of our hybrid CNN-Transformer model in aiding the early and accurate detection of gastrointestinal (GI) anomalies.

8/21/2024

S-E Pipeline: A Vision Transformer (ViT) based Resilient Classification Pipeline for Medical Imaging Against Adversarial Attacks

Neha A S, Vivek Chaturvedi, Muhammad Shafique

Vision Transformer (ViT) is becoming widely popular in automating accurate disease diagnosis in medical imaging owing to its robust self-attention mechanism. However, ViTs remain vulnerable to adversarial attacks that may thwart the diagnosis process by leading it to intentional misclassification of critical disease. In this paper, we propose a novel image classification pipeline, namely, S-E Pipeline, that performs multiple pre-processing steps that allow ViT to be trained on critical features so as to reduce the impact of input perturbations by adversaries. Our method uses a combination of segmentation and image enhancement techniques such as Contrast Limited Adaptive Histogram Equalization (CLAHE), Unsharp Masking (UM), and High-Frequency Emphasis filtering (HFE) as preprocessing steps to identify critical features that remain intact even after adversarial perturbations. The experimental study demonstrates that our novel pipeline helps in reducing the effect of adversarial attacks by 72.22% for the ViT-b32 model and 86.58% for the ViT-l32 model. Furthermore, we have shown an end-to-end deployment of our proposed method on the NVIDIA Jetson Orin Nano board to demonstrate its practical use case in modern hand-held devices that are usually resource-constrained.

7/26/2024