Quantifying uncertainty in lung cancer segmentation with foundation models applied to mixed-domain datasets

Read original: arXiv:2403.13113 - Published 9/5/2024 by Aneesh Rangnekar, Nishant Nadkarni, Jue Jiang, Harini Veeraraghavan

Quantifying uncertainty in lung cancer segmentation with foundation models applied to mixed-domain datasets

Overview

This research paper examines the trustworthiness of pre-trained transformer models for lung cancer segmentation.
The study evaluates the performance and reliability of these models on a lung cancer dataset.
The researchers aim to provide insights into the trustworthiness and limitations of using pre-trained transformers for this medical imaging task.

Plain English Explanation

Transformers are a type of machine learning model that have shown impressive performance on a variety of tasks, including medical image analysis. In this paper, the researchers investigate how reliable and trustworthy these pre-trained transformer models are when used for segmenting lung cancer in medical scans.

Segmentation is the process of identifying and outlining specific regions or structures within an image. For lung cancer diagnosis and treatment, accurately segmenting the tumor is crucial. The researchers wanted to see how well pre-trained transformer models, which have been trained on large, general datasets, can perform this specialized medical imaging task.

They tested the transformer models on a dataset of lung cancer scans and compared their performance to other types of models. The goal was to understand the strengths and limitations of using these pre-trained transformers for lung cancer segmentation, and to provide guidance on how trustworthy the results from these models can be.

Technical Explanation

The researchers evaluated the performance of several pre-trained transformer models on a lung cancer segmentation task. They fine-tuned the pre-trained models on a dataset of CT scans with annotated lung cancer regions.

The network architectures tested included ViT, Swin Transformer, and UNETR, which are state-of-the-art transformer-based models for medical image segmentation. The researchers also compared the transformer models to a classical U-Net architecture as a baseline.

To assess the trustworthiness of the models, the researchers analyzed their performance metrics, such as Dice score and Hausdorff distance, as well as their calibration - the alignment between model confidence and actual accuracy. They also evaluated the models' robustness to distribution shift by testing on out-of-distribution data.

The results showed that the pre-trained transformer models generally outperformed the U-Net baseline in terms of segmentation accuracy. However, the transformers exhibited lower calibration, meaning their confidence estimates did not always align well with their true performance. The models also showed some sensitivity to distribution shift, with degraded performance on out-of-distribution data.

Critical Analysis

The paper provides a useful assessment of the trustworthiness of using pre-trained transformer models for the specialized task of lung cancer segmentation. The researchers acknowledge that while the transformers achieve strong performance, their overconfidence and sensitivity to distribution shift are potential limitations that need to be considered.

One area that could be explored further is the interpretability of the transformer models. Understanding how these complex models arrive at their predictions could help build trust and enable clinicians to better understand the model's decision-making process.

Additionally, the paper focuses on a single dataset and task. Expanding the evaluation to a wider range of lung cancer datasets and tasks, such as detection and diagnosis, could provide a more comprehensive assessment of the transformers' trustworthiness.

Conclusion

This research paper offers valuable insights into the trustworthiness of using pre-trained transformer models for the critical task of lung cancer segmentation. While the transformers demonstrate strong performance, the findings suggest that their reliability and calibration should be carefully considered when deploying these models in real-world medical settings. The study highlights the importance of thoroughly evaluating the trustworthiness of AI systems, especially in high-stakes domains like healthcare, to ensure they can be safely and reliably used.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Quantifying uncertainty in lung cancer segmentation with foundation models applied to mixed-domain datasets

Aneesh Rangnekar, Nishant Nadkarni, Jue Jiang, Harini Veeraraghavan

Medical image foundation models have shown the ability to segment organs and tumors with minimal fine-tuning. These models are typically evaluated on task-specific in-distribution (ID) datasets. However, reliable performance on ID dataset does not guarantee robust generalization on out-of-distribution (OOD) datasets. Importantly, once deployed for clinical use, it is impractical to have `ground truth' delineations to assess ongoing performance drifts, especially when images fall into OOD category due to different imaging protocols. Hence, we introduced a comprehensive set of computationally fast metrics to evaluate the performance of multiple foundation models (Swin UNETR, SimMIM, iBOT, SMIT) trained with self-supervised learning (SSL). SSL pretraining was selected as this approach is applicable for large, diverse, and unlabeled image sets. All models were fine-tuned on identical datasets for lung tumor segmentation from computed tomography (CT) scans. SimMIM, iBOT, and SMIT used identical architecture, pretraining, and fine-tuning datasets to assess performance variations with the choice of pretext tasks used in SSL. Evaluation was performed on two public lung cancer datasets (LRAD: n = 140, 5Rater: n = 21) with different image acquisitions and tumor stage compared to training data (n = 317 public resource with stage III-IV lung cancers) and a public non-cancer dataset containing volumetric CT scans of patients with pulmonary embolism (n = 120). All models produced similarly accurate tumor segmentation on the lung cancer testing datasets. SMIT produced a highest F1-score (LRAD: 0.60, 5Rater: 0.64) and lowest entropy (LRAD: 0.06, 5Rater: 0.12), indicating higher tumor detection rate and confident segmentations. In the OOD dataset, SMIT misdetected least number of tumors, indicated by median volume occupancy of 5.67 cc compared to second best method SimMIM of 9.97 cc.

9/5/2024

An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation

Qin Li, Yizhe Zhang, Yan Li, Jun Lyu, Meng Liu, Longyu Sun, Mengting Sun, Qirong Li, Wenyue Mao, Xinran Wu, Yajing Zhang, Yinghua Chu, Shuo Wang, Chengyan Wang

The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1056 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation.

6/19/2024

📉

Transformer-based segmentation of adnexal lesions and ovarian implants in CT images

Aneesh Rangnekar, Kevin M. Boehm, Emily A. Aherne, Ines Nikolovski, Natalie Gangai, Ying Liu, Dimitry Zamarin, Kara L. Roche, Sohrab P. Shah, Yulia Lakhman, Harini Veeraraghavan

Two self-supervised pretrained transformer-based segmentation models (SMIT and Swin UNETR) fine-tuned on a dataset of ovarian cancer CT images provided reasonably accurate delineations of the tumors in an independent test dataset. Tumors in the adnexa were segmented more accurately by both transformers (SMIT and Swin UNETR) than the omental implants. AI-assisted labeling performed on 72 out of 245 omental implants resulted in smaller manual editing effort of 39.55 mm compared to full manual correction of partial labels of 106.49 mm and resulted in overall improved accuracy performance. Both SMIT and Swin UNETR did not generate any false detection of omental metastases in the urinary bladder and relatively few false detections in the small bowel, with 2.16 cc on average for SMIT and 7.37 cc for Swin UNETR respectively.

6/26/2024

🤖

AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets

Fakrul Islam Tushar, Avivah Wang, Lavsen Dahal, Michael R. Harowicz, Kyle J. Lafata, Tina D. Tailor, Joseph Y. Lo

Lung cancer's high mortality rate can be mitigated by early detection, increasingly reliant on AI for diagnostic imaging. However, AI model performance depends on training and validation datasets. This study develops and validates AI models for both nodule detection and cancer classification tasks. For detection, two models (DLCSD-mD and LUNA16-mD) were developed using the Duke Lung Cancer Screening Dataset (DLCSD), with over 2,000 CT scans from 1,613 patients and more than 3,000 annotations. These models were evaluated on internal (DLCSD) and external datasets, including LUNA16 (601 patients, 1186 nodules) and NLST (969 patients, 1192 nodules), using FROC analysis and AUC metrics. For classification, five models were developed and tested: a randomly initialized 3D ResNet50, Genesis, MedNet3D, an enhanced ResNet50 using Strategic Warm-Start++ (SWS++), and a linear classifier analyzing features from the Foundation Model for Cancer Biomarkers (FMCB). These models were trained to distinguish between benign and malignant nodules and evaluated using AUC analysis on internal (DLCSD) and external datasets, including LUNA16 (433 patients, 677 nodules) and NLST. The DLCSD-mD model achieved an AUC of 0.93 (95% CI: 0.91-0.94) on the internal DLCSD dataset. External validation results were 0.97 (95% CI: 0.96-0.98) on LUNA16 and 0.75 (95% CI: 0.73-0.76) on NLST. For classification, the ResNet50-SWS++ model recorded AUCs of 0.71 (95% CI: 0.61-0.81) on DLCSD, 0.90 (95% CI: 0.87-0.93) on LUNA16, and 0.81 (95% CI: 0.79-0.82) on NLST. Other models showed varying performance across datasets, underscoring the importance of diverse model approaches. This benchmarking establishes DLCSD as a reliable resource for lung cancer AI research.

6/14/2024