Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero shot Medical Image Segmentation

Read original: arXiv:2404.06362 - Published 5/1/2024 by Sidra Aleem, Fangyijie Wang, Mayug Maniparambil, Eric Arazo, Julia Dietlmeier, Guenole Silvestre, Kathleen Curran, Noel E. O'Connor, Suzanne Little

Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero shot Medical Image Segmentation

Overview

Presents a novel approach for zero-shot medical image segmentation, called SaLIP, which cascades the Segment Anything Model (SAM) and the Contrastive Language-Image Pre-training (CLIP) model.
SaLIP leverages the capabilities of these models to enable zero-shot segmentation of medical images without any task-specific training.
Demonstrates the effectiveness of SaLIP on various medical image segmentation tasks, outperforming state-of-the-art methods.

Plain English Explanation

The research paper introduces a new technique called SaLIP (Segment Anything with Language-Image Pre-training) for performing medical image segmentation without any prior training on the specific task. This is known as "zero-shot" segmentation, meaning the model can segment images of medical structures without being explicitly trained on those structures.

SaLIP works by combining two powerful AI models: the Segment Anything Model (SAM) and the Contrastive Language-Image Pre-training (CLIP) model. The SAM model can segment any object in an image, while the CLIP model can understand the relationship between language and visual concepts.

By putting these two models together in a "cascade", SaLIP can take a medical image, understand the textual description of the anatomical structure to be segmented, and then precisely outline that structure in the image - all without any prior training on that specific medical task. This makes SaLIP a very flexible and powerful tool for medical image analysis.

The researchers demonstrate that SaLIP outperforms other state-of-the-art methods for zero-shot medical image segmentation, showing its effectiveness on a variety of medical imaging tasks such as segmenting eye features and analyzing medical videos.

Technical Explanation

The paper introduces a novel approach for zero-shot medical image segmentation called SaLIP, which cascades the Segment Anything Model (SAM) and the Contrastive Language-Image Pre-training (CLIP) model.

SaLIP leverages the capabilities of these two models to enable zero-shot segmentation of medical images without any task-specific training. The SAM model, which can segment any object in an image, is used to generate segmentation proposals. These proposals are then refined and filtered using the CLIP model, which can understand the relationship between language and visual concepts.

By combining SAM and CLIP in a cascaded architecture, SaLIP can take a medical image and a textual description of the target anatomical structure, and then precisely outline that structure in the image - without any prior training on that specific medical task.

The researchers evaluate SaLIP on various medical image segmentation tasks, including segmenting eye features, analyzing medical videos, and extending CLIP's image-text alignment capabilities. The results demonstrate that SaLIP outperforms state-of-the-art methods for zero-shot medical image segmentation, showcasing its effectiveness and flexibility.

Critical Analysis

The paper presents a promising approach for zero-shot medical image segmentation, but there are a few potential limitations and areas for further research:

The performance of SaLIP may be dependent on the quality and coverage of the language-image pre-training data used to train CLIP. Exploring the impact of different pre-training datasets could be an interesting direction.
While SaLIP demonstrates strong results on various medical image segmentation tasks, it would be valuable to assess its performance on a broader range of medical imaging modalities and anatomical structures.
The paper does not provide a detailed analysis of the failure cases or limitations of SaLIP. Understanding the types of medical images or anatomical structures that are challenging for the model could help guide future improvements.
The integration of SAM and CLIP is relatively straightforward in the current work. Investigating more advanced ways of combining these models, or exploring other pre-trained models, could potentially further enhance the zero-shot segmentation capabilities.

Overall, the SaLIP approach represents an exciting advancement in the field of zero-shot medical image segmentation, and the researchers have demonstrated its effectiveness through rigorous experimentation. Addressing the potential limitations and exploring further avenues of research could lead to even more impactful developments in this area.

Conclusion

The paper presents a novel approach called SaLIP for zero-shot medical image segmentation, which cascades the Segment Anything Model (SAM) and the Contrastive Language-Image Pre-training (CLIP) model. SaLIP leverages the capabilities of these two powerful AI models to enable the segmentation of medical images without any task-specific training.

The researchers demonstrate that SaLIP outperforms state-of-the-art methods for zero-shot medical image segmentation, showcasing its effectiveness on a variety of tasks such as segmenting eye features, analyzing medical videos, and extending CLIP's image-text alignment capabilities. This work represents an important step forward in the development of flexible and robust medical image analysis tools, with the potential to significantly impact various clinical applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero shot Medical Image Segmentation

Sidra Aleem, Fangyijie Wang, Mayug Maniparambil, Eric Arazo, Julia Dietlmeier, Guenole Silvestre, Kathleen Curran, Noel E. O'Connor, Suzanne Little

The Segment Anything Model (SAM) and CLIP are remarkable vision foundation models (VFMs). SAM, a prompt driven segmentation model, excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero shot recognition capabilities. However, their unified potential has not yet been explored in medical image segmentation. To adapt SAM to medical imaging, existing methods primarily rely on tuning strategies that require extensive data or prior prompts tailored to the specific task, making it particularly challenging when only a limited number of data samples are available. This work presents an in depth exploration of integrating SAM and CLIP into a unified framework for medical image segmentation. Specifically, we propose a simple unified framework, SaLIP, for organ segmentation. Initially, SAM is used for part based segmentation within the image, followed by CLIP to retrieve the mask corresponding to the region of interest (ROI) from the pool of SAM generated masks. Finally, SAM is prompted by the retrieved ROI to segment a specific organ. Thus, SaLIP is training and fine tuning free and does not rely on domain expertise or labeled data for prompt engineering. Our method shows substantial enhancements in zero shot segmentation, showcasing notable improvements in DICE scores across diverse segmentation tasks like brain (63.46%), lung (50.11%), and fetal head (30.82%), when compared to un prompted SAM. Code and text prompts are available at: https://github.com/aleemsidra/SaLIP.

5/1/2024

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy. Code is available at https://github.com/HealthX-Lab/MedCLIP-SAM.

6/21/2024

➖

Zero Shot Context-Based Object Segmentation using SLIP (SAM+CLIP)

Saaketh Koundinya Gundavarapu, Arushi Arora, Shreya Agarwal

We present SLIP (SAM+CLIP), an enhanced architecture for zero-shot object segmentation. SLIP combines the Segment Anything Model (SAM) cite{kirillov2023segment} with the Contrastive Language-Image Pretraining (CLIP) cite{radford2021learning}. By incorporating text prompts into SAM using CLIP, SLIP enables object segmentation without prior training on specific classes or categories. We fine-tune CLIP on a Pokemon dataset, allowing it to learn meaningful image-text representations. SLIP demonstrates the ability to recognize and segment objects in images based on contextual information from text prompts, expanding the capabilities of SAM for versatile object segmentation. Our experiments demonstrate the effectiveness of the SLIP architecture in segmenting objects in images based on textual cues. The integration of CLIP's text-image understanding capabilities into SAM expands the capabilities of the original architecture and enables more versatile and context-aware object segmentation.

5/27/2024

👀

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

6/12/2024