Surgical-DeSAM: Decoupling SAM for Instrument Segmentation in Robotic Surgery

Read original: arXiv:2404.14040 - Published 4/23/2024 by Yuyang Sheng, Sophia Bano, Matthew J. Clarkson, Mobarakol Islam

Surgical-DeSAM: Decoupling SAM for Instrument Segmentation in Robotic Surgery

Overview

This paper presents Surgical-DeSAM, a method for decoupling the Segment Anything Model (SAM) to improve its performance on surgical instrument segmentation in robotic surgery.
The authors address limitations of the original SAM by modifying its architecture to better handle the specific challenges of surgical instrument segmentation.
The proposed Surgical-DeSAM approach demonstrates improved segmentation accuracy and efficiency compared to the standard SAM on a surgical instrument dataset.

Plain English Explanation

The Segment Anything Model (SAM) is a powerful AI model that can identify and outline objects in images. However, when it comes to the specific task of segmenting surgical instruments in robotic surgery videos, the standard SAM may not perform as well.

The researchers in this paper developed a modified version of SAM, called Surgical-DeSAM, that is better suited for this medical application. They "decoupled" or separated different parts of the SAM architecture to improve its ability to accurately segment the complex shapes and motions of surgical instruments.

By making these changes, Surgical-DeSAM was able to outperform the original SAM model on a dataset of surgical procedure videos. This means it could more precisely outline the surgical tools, which could be useful for applications like automating surgical workflow analysis or providing better visual guidance during procedures.

The key innovation here is adapting a powerful AI model like SAM to work more effectively in a specialized medical context, overcoming some of its limitations. This demonstrates the potential for further customizing and optimizing general-purpose AI systems to tackle specific real-world challenges.

Technical Explanation

The authors propose Surgical-DeSAM, a modified version of the Segment Anything Model (SAM) tailored for the task of surgical instrument segmentation in robotic surgery.

The standard SAM architecture consists of a Vision Transformer (ViT) encoder and a multi-layer perceptron (MLP) decoder. The authors identify two key limitations of this design for surgical instrument segmentation:

The ViT encoder struggles to capture the fine-grained details of surgical instruments, which have complex shapes and exhibit significant occlusion and articulation.
The MLP decoder struggles to precisely localize the boundaries of the surgical instruments.

To address these issues, Surgical-DeSAM decouples the SAM architecture into separate components for feature extraction and segmentation prediction. Specifically:

A specialized CNN-based feature extractor is used instead of the ViT encoder to better capture the visual characteristics of surgical instruments.
A segmentation head based on a Pathological Primitive Segmentation (PPS) module is used instead of the MLP decoder to improve boundary localization.

The authors evaluate Surgical-DeSAM on a surgical instrument segmentation dataset and show that it outperforms the standard SAM in terms of both segmentation accuracy and inference speed. They also demonstrate the model's ability to generalize to unseen surgical tools and handle occlusion effectively.

Critical Analysis

The authors have made a thoughtful effort to adapt the general-purpose SAM model to the specific challenges of surgical instrument segmentation. By decoupling the feature extraction and segmentation components, they have been able to improve the model's performance on this specialized task.

However, some potential limitations and areas for further research are worth considering:

The evaluation was conducted on a single surgical instrument dataset, so the generalizability of Surgical-DeSAM to other surgical settings or instrument types is still uncertain. More extensive testing on diverse datasets would help validate the model's broader applicability.
The authors do not provide a detailed analysis of the computational and memory requirements of Surgical-DeSAM compared to the original SAM. This information would be helpful to assess the tradeoffs between performance improvements and increased model complexity.
While the PPS module used in the segmentation head is an interesting approach, further research is needed to understand its limitations and potential biases when applied to medical imaging tasks.
The paper does not discuss the potential for test-time adaptation strategies to further improve the model's performance on individual patients or surgical procedures.

Overall, the Surgical-DeSAM approach demonstrates the value of customizing general-purpose AI models to better suit specific application domains. The authors have taken a thoughtful step in this direction, but continued research and validation will be important to realize the full potential of this approach in real-world surgical settings.

Conclusion

This paper introduces Surgical-DeSAM, a modified version of the Segment Anything Model (SAM) that is better suited for the task of surgical instrument segmentation in robotic surgery. By decoupling the feature extraction and segmentation components of the SAM architecture, the authors have been able to improve both the accuracy and efficiency of the model on this specialized medical application.

The key innovation of Surgical-DeSAM is its ability to overcome the limitations of the standard SAM when dealing with the complex shapes, motions, and occlusions of surgical instruments. This demonstrates the value of adapting general-purpose AI models to tackle specific real-world challenges, rather than relying on a one-size-fits-all approach.

While further research is needed to fully validate the generalizability and practical implications of Surgical-DeSAM, this work represents an important step forward in the development of advanced computer vision techniques for robotic surgery. As AI systems become more widely adopted in medical settings, this type of specialized model customization will likely play a crucial role in unlocking their full potential to enhance surgical outcomes and patient care.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Surgical-DeSAM: Decoupling SAM for Instrument Segmentation in Robotic Surgery

Yuyang Sheng, Sophia Bano, Matthew J. Clarkson, Mobarakol Islam

Purpose: The recent Segment Anything Model (SAM) has demonstrated impressive performance with point, text or bounding box prompts, in various applications. However, in safety-critical surgical tasks, prompting is not possible due to (i) the lack of per-frame prompts for supervised learning, (ii) it is unrealistic to prompt frame-by-frame in a real-time tracking application, and (iii) it is expensive to annotate prompts for offline applications. Methods: We develop Surgical-DeSAM to generate automatic bounding box prompts for decoupling SAM to obtain instrument segmentation in real-time robotic surgery. We utilise a commonly used detection architecture, DETR, and fine-tuned it to obtain bounding box prompt for the instruments. We then empolyed decoupling SAM (DeSAM) by replacing the image encoder with DETR encoder and fine-tune prompt encoder and mask decoder to obtain instance segmentation for the surgical instruments. To improve detection performance, we adopted the Swin-transformer to better feature representation. Results: The proposed method has been validated on two publicly available datasets from the MICCAI surgical instruments segmentation challenge EndoVis 2017 and 2018. The performance of our method is also compared with SOTA instrument segmentation methods and demonstrated significant improvements with dice metrics of 89.62 and 90.70 for the EndoVis 2017 and 2018. Conclusion: Our extensive experiments and validations demonstrate that Surgical-DeSAM enables real-time instrument segmentation without any additional prompting and outperforms other SOTA segmentation methods.

4/23/2024

📈

DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation

Yifan Gao, Wei Xia, Dingdu Hu, Wenkui Wang, Xin Gao

Deep learning-based medical image segmentation models often suffer from domain shift, where the models trained on a source domain do not generalize well to other unseen domains. As a prompt-driven foundation model with powerful generalization capabilities, the Segment Anything Model (SAM) shows potential for improving the cross-domain robustness of medical image segmentation. However, SAM performs significantly worse in automatic segmentation scenarios than when manually prompted, hindering its direct application to domain generalization. Upon further investigation, we discovered that the degradation in performance was related to the coupling effect of inevitable poor prompts and mask generation. To address the coupling effect, we propose the Decoupled SAM (DeSAM). DeSAM modifies SAM's mask decoder by introducing two new modules: a prompt-relevant IoU module (PRIM) and a prompt-decoupled mask module (PDMM). PRIM predicts the IoU score and generates mask embeddings, while PDMM extracts multi-scale features from the intermediate layers of the image encoder and fuses them with the mask embeddings from PRIM to generate the final segmentation mask. This decoupled design allows DeSAM to leverage the pre-trained weights while minimizing the performance degradation caused by poor prompts. We conducted experiments on publicly available cross-site prostate and cross-modality abdominal image segmentation datasets. The results show that our DeSAM leads to a substantial performance improvement over previous state-of-theart domain generalization methods. The code is publicly available at https://github.com/yifangao112/DeSAM.

7/10/2024

SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation

Jieming Yu, An Wang, Wenzhen Dong, Mengya Xu, Mobarakol Islam, Jie Wang, Long Bai, Hongliang Ren

The recent Segment Anything Model (SAM) 2 has demonstrated remarkable foundational competence in semantic segmentation, with its memory mechanism and mask decoder further addressing challenges in video tracking and object occlusion, thereby achieving superior results in interactive segmentation for both images and videos. Building upon our previous empirical studies, we further explore the zero-shot segmentation performance of SAM 2 in robot-assisted surgery based on prompts, alongside its robustness against real-world corruption. For static images, we employ two forms of prompts: 1-point and bounding box, while for video sequences, the 1-point prompt is applied to the initial frame. Through extensive experimentation on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 2, when utilizing bounding box prompts, outperforms state-of-the-art (SOTA) methods in comparative evaluations. The results with point prompts also exhibit a substantial enhancement over SAM's capabilities, nearing or even surpassing existing unprompted SOTA methodologies. Besides, SAM 2 demonstrates improved inference speed and less performance degradation against various image corruption. Although slightly unsatisfactory results remain in specific edges or regions, SAM 2's robust adaptability to 1-point prompts underscores its potential for downstream surgical tasks with limited prompt requirements.

8/9/2024

Adapting SAM for Surgical Instrument Tracking and Segmentation in Endoscopic Submucosal Dissection Videos

Jieming Yu, Long Bai, Guankun Wang, An Wang, Xiaoxiao Yang, Huxin Gao, Hongliang Ren

The precise tracking and segmentation of surgical instruments have led to a remarkable enhancement in the efficiency of surgical procedures. However, the challenge lies in achieving accurate segmentation of surgical instruments while minimizing the need for manual annotation and reducing the time required for the segmentation process. To tackle this, we propose a novel framework for surgical instrument segmentation and tracking. Specifically, with a tiny subset of frames for segmentation, we ensure accurate segmentation across the entire surgical video. Our method adopts a two-stage approach to efficiently segment videos. Initially, we utilize the Segment-Anything (SAM) model, which has been fine-tuned using the Low-Rank Adaptation (LoRA) on the EndoVis17 Dataset. The fine-tuned SAM model is applied to segment the initial frames of the video accurately. Subsequently, we deploy the XMem++ tracking algorithm to follow the annotated frames, thereby facilitating the segmentation of the entire video sequence. This workflow enables us to precisely segment and track objects within the video. Through extensive evaluation of the in-distribution dataset (EndoVis17) and the out-of-distribution datasets (EndoVis18 & the endoscopic submucosal dissection surgery (ESD) dataset), our framework demonstrates exceptional accuracy and robustness, thus showcasing its potential to advance the automated robotic-assisted surgery.

8/9/2024