One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts

Read original: arXiv:2312.17183 - Published 7/12/2024 by Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

📈

Overview

This study focuses on building a model called "Segment Anything in medical scenarios, driven by Text prompts" (SAT) to segment medical images using text prompts.
The researchers made three key contributions:
1. Constructed a multi-modal knowledge tree on human anatomy with over 6,500 anatomical terms, and built the largest comprehensive segmentation dataset with over 22,000 3D medical image scans.
2. Developed a universal segmentation model that can be prompted with medical terminology in text form.
3. Trained a 447M parameter SAT model to segment 72 different datasets across 497 classes, achieving performance comparable to 72 specialized nnU-Net models with 2.2B parameters.

Plain English Explanation

The researchers wanted to create an AI model that could segment, or outline, different anatomical structures in medical images, just by describing what they wanted to see in plain text. This is a challenging task, as medical images can be complex, and there are thousands of different anatomical structures that need to be identified.

To tackle this problem, the researchers first built a comprehensive database of medical knowledge, including over 6,500 different anatomical terms. They also collected a huge dataset of over 22,000 medical scans, carefully standardizing the images and labels.

Next, they developed a universal segmentation model that could be prompted with text descriptions to outline the relevant anatomical structures. This model was trained on the large dataset they had compiled.

Finally, the researchers trained this model, called SAT-Pro, with only 447 million parameters. This is a relatively small model, especially compared to the 2.2 billion parameters used in 72 specialized models trained on the individual datasets. Yet the SAT-Pro model was able to achieve comparable performance to these specialized models across 72 different segmentation datasets and 497 classes of anatomical structures.

The key innovation here is the ability to use simple text prompts to guide the AI model, rather than requiring detailed training on each individual anatomical structure. This could make medical image analysis much more accessible and scalable.

Technical Explanation

The researchers began by constructing a multi-modal knowledge tree on human anatomy, including 6,502 anatomical terms. They then built the largest and most comprehensive segmentation dataset for training, collecting over 22,000 3D medical image scans from 72 segmentation datasets and carefully standardizing the image scans and label space.

For the architecture design, the team formulated a universal segmentation model that can be prompted with medical terminology in text form. This involved knowledge-enhanced representation learning on the combination of the large dataset.

The researchers then trained a 447M parameter model called SAT-Pro to segment the 72 different segmentation datasets, covering 497 classes. They thoroughly evaluated the model's performance from three perspectives: averaged by body regions, averaged by classes, and averaged by datasets. Remarkably, the SAT-Pro model achieved comparable performance to 72 specialized nnU-Net models, each trained on a single dataset, which had a total of around 2.2B parameters.

Critical Analysis

The research presented in this paper is highly impressive, demonstrating significant advancements in the field of medical image segmentation. The ability to use a single, relatively compact model to segment a wide range of anatomical structures across numerous datasets is a remarkable achievement.

However, the paper does mention a few caveats and limitations. For example, the researchers note that the performance of the SAT-Pro model, while comparable to the specialized nnU-Net models, is not uniformly superior across all datasets and classes. There may be room for further improvement, especially on more challenging or niche anatomical structures.

Additionally, the dataset construction process, while comprehensive, may still be subject to biases or inconsistencies inherent in the source data. The researchers acknowledge the need for continued refinement and expansion of the dataset to ensure robust and unbiased model performance.

Future research could explore strategies to further enhance the model's capabilities, such as multi-rater prompting for ambiguous medical image segmentation or incorporating additional modalities of medical data beyond just image scans.

Overall, this research represents a significant step forward in the development of universal and scalable medical image analysis tools, with the potential to greatly improve the efficiency and accessibility of medical diagnostics and treatment planning.

Conclusion

This study presents a groundbreaking approach to medical image segmentation, with the development of a versatile AI model called SAT (Segment Anything in medical scenarios, driven by Text prompts). The researchers made several key contributions, including the construction of a comprehensive anatomical knowledge base, the creation of a large-scale segmentation dataset, and the design of a universal segmentation model that can be prompted with text descriptions.

The resulting SAT-Pro model, with just 447 million parameters, was able to achieve performance comparable to 72 specialized nnU-Net models with a combined 2.2 billion parameters. This remarkable feat demonstrates the power of the researchers' approach and its potential to revolutionize the field of medical image analysis.

By enabling segmentation through simple text prompts, the SAT model could make medical image processing much more accessible and scalable, empowering clinicians and researchers to quickly and easily extract relevant anatomical information from complex medical scans. As the researchers continue to refine and expand this technology, it could have far-reaching implications for improving healthcare outcomes and advancing medical research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

In this study, we aim to build up a model that can Segment Anything in radiology scans, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) for dataset construction, we construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then we build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans from 72 segmentation datasets, across 497 classes, with careful standardization on both image scans and label space; (ii) for architecture design, we propose to inject medical knowledge into a text encoder via contrastive learning, and then formulate a universal segmentation model, that can be prompted by feeding in medical terminologies in text form; (iii) As a result, we have trained SAT-Nano (110M parameters) and SAT-Pro (447M parameters), demonstrating comparable performance to 72 specialist nnU-Nets trained on each dataset/subsets. We validate SAT as a foundational segmentation model, with better generalization ability on external (unseen) datasets, and can be further improved on specific tasks after fine-tuning adaptation. Comparing with interactive segmentation model, for example, MedSAM, segmentation model prompted by text enables superior performance, scalability and robustness. As a use case, we demonstrate that SAT can act as a powerful out-of-the-box agent for large language models, enabling visual grounding in clinical procedures such as report generation. All the data, codes, and models in this work have been released.

7/12/2024

📉

One-Prompt to Segment All Medical Images

Junde Wu, Jiayuan Zhu, Yuanpei Liu, Yueming Jin, Min Xu

Large foundation models, known for their strong zero-shot generalization, have excelled in visual and language applications. However, applying them to medical image segmentation, a domain with diverse imaging types and target labels, remains an open challenge. Current approaches, such as adapting interactive segmentation models like Segment Anything Model (SAM), require user prompts for each sample during inference. Alternatively, transfer learning methods like few/one-shot models demand labeled samples, leading to high costs. This paper introduces a new paradigm toward the universal medical image segmentation, termed 'One-Prompt Segmentation.' One-Prompt Segmentation combines the strengths of one-shot and interactive methods. In the inference stage, with just textbf{one prompted sample}, it can adeptly handle the unseen task in a single forward pass. We train One-Prompt Model on 64 open-source medical datasets, accompanied by the collection of over 3,000 clinician-labeled prompts. Tested on 14 previously unseen datasets, the One-Prompt Model showcases superior zero-shot segmentation capabilities, outperforming a wide range of related methods. The code and data is released as url{https://github.com/KidsWithTokens/one-prompt}.

4/12/2024

CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation

Zhongzhen Huang, Yankai Jiang, Rongzhao Zhang, Shaoting Zhang, Xiaofan Zhang

Existing promptable segmentation methods in the medical imaging field primarily consider either textual or visual prompts to segment relevant objects, yet they often fall short when addressing anomalies in medical images, like tumors, which may vary greatly in shape, size, and appearance. Recognizing the complexity of medical scenarios and the limitations of textual or visual prompts, we propose a novel dual-prompt schema that leverages the complementary strengths of visual and textual prompts for segmenting various organs and tumors. Specifically, we introduce CAT, an innovative model that Coordinates Anatomical prompts derived from 3D cropped images with Textual prompts enriched by medical domain knowledge. The model architecture adopts a general query-based design, where prompt queries facilitate segmentation queries for mask prediction. To synergize two types of prompts within a unified framework, we implement a ShareRefiner, which refines both segmentation and prompt queries while disentangling the two types of prompts. Trained on a consortium of 10 public CT datasets, CAT demonstrates superior performance in multiple segmentation tasks. Further validation on a specialized in-house dataset reveals the remarkable capacity of segmenting tumors across multiple cancer stages. This approach confirms that coordinating multimodal prompts is a promising avenue for addressing complex scenarios in the medical domain.

6/12/2024

Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

Jie Liu, Yixiao Zhang, Kang Wang, Mehmet Can Yavuz, Xiaoxi Chen, Yixuan Yuan, Haoliang Li, Yang Yang, Alan Yuille, Yucheng Tang, Zongwei Zhou

The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and learning scheme. To overcome these limitations, we propose a universal, extensible framework enabling a single model, termed Universal Model, to deal with multiple public datasets and adapt to new classes (e.g., organs/tumors). Firstly, we introduce a novel language-driven parameter generator that leverages language embeddings from large language models, enriching semantic encoding compared with one-hot encoding. Secondly, the conventional output layers are replaced with lightweight, class-specific heads, allowing Universal Model to simultaneously segment 25 organs and six types of tumors and ease the addition of new classes. We train our Universal Model on 3,410 CT volumes assembled from 14 publicly available datasets and then test it on 6,173 CT volumes from four external datasets. Universal Model achieves first place on six CT tasks in the Medical Segmentation Decathlon (MSD) public leaderboard and leading performance on the Beyond The Cranial Vault (BTCV) dataset. In summary, Universal Model exhibits remarkable computational efficiency (6x faster than other dataset-specific models), demonstrates strong generalization across different hospitals, transfers well to numerous downstream tasks, and more importantly, facilitates the extensibility to new classes while alleviating the catastrophic forgetting of previously learned classes. Codes, models, and datasets are available at https://github.com/ljwztc/CLIP-Driven-Universal-Model

5/29/2024