RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis

Read original: arXiv:2404.16754 - Published 4/26/2024 by Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Jiayu Lei, Ya Zhang, Yanfeng Wang, Weidi Xie

👨‍🏫

Overview

Researchers in AI for Medicine (AI4Medicine) have been focusing on developing generalist foundation models, which rely heavily on large, diverse datasets.
This paper introduces a new comprehensive dataset called RadGenome-Chest CT, which is designed to enable the development of advanced multimodal medical foundation models.
The dataset includes:
- Organ-level segmentation masks for 197 categories in chest CT scans
- 665K multi-granularity grounded reports, where each sentence is linked to a corresponding anatomical region
- 1.3M grounded visual question-answering (VQA) pairs, with questions and answers linked to reference segmentation masks

Plain English Explanation

Researchers working in the field of AI for Medicine have been very interested in creating "generalist" AI models that can handle a wide range of medical tasks. A key insight is that these models need to be trained on large, diverse datasets that cover different medical imaging modalities and provide rich supervision signals.

The paper introduces a new dataset called RadGenome-Chest CT that is designed to help develop these advanced AI models for medical image understanding. The dataset includes 3D chest CT scans from over 20,000 patients, along with a wealth of additional information:

Detailed segmentation masks that outline 197 different anatomical structures in the CT scans. This provides the models with a visual "roadmap" of the body.
Over 665,000 medical reports that are linked to the specific regions of the CT scans they describe. This allows the models to learn how to associate textual descriptions with visual evidence.
1.3 million question-and-answer pairs that are also connected to the segmentation masks, enabling the models to practice reasoning about the visual information.

By training on this rich, multimodal dataset, the researchers hope to develop AI models that can truly understand and reason about medical images in a more holistic and human-like way.

Technical Explanation

The paper introduces a new large-scale dataset called RadGenome-Chest CT that is designed to facilitate the development of advanced multimodal medical foundation models. The dataset is built upon the existing CT-RATE dataset, which contains over 25,692 non-contrast 3D chest CT volumes and corresponding clinical reports from 20,000 patients.

The key innovations of the RadGenome-Chest CT dataset are:

Organ-level Segmentation Masks: The researchers leveraged powerful universal segmentation models to generate detailed segmentation masks covering 197 anatomical categories in the chest CT scans. These segmentation maps provide rich visual clues to help models better interpret the medical images.
Multi-granularity Grounded Reports: The dataset includes 665,000 medical reports, where each sentence is linked to the corresponding anatomical region in the CT volume via the segmentation masks. This enables models to learn how to associate textual descriptions with specific visual features.
Grounded Visual Question Answering: The researchers also curated 1.3 million question-answer pairs that are connected to the segmentation masks, allowing models to practice reasoning about the visual evidence and explaining their answers.

The validation set of the dataset has been manually verified to ensure high quality. The researchers believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models by providing rich, grounded training data that can teach models to generate relevant text based on visual information.

Critical Analysis

The RadGenome-Chest CT dataset represents a significant advancement in the creation of large, diverse medical imaging datasets to support the development of generalist AI models for healthcare. By incorporating organ-level segmentation masks, grounded textual reports, and VQA pairs, the dataset provides a wealth of multimodal supervision signals that go beyond previous efforts.

However, the paper does not address some potential limitations and areas for further research:

Dataset Representativeness: While the dataset includes a large number of patients, it is unclear how representative the data is of the broader population. Potential biases in the patient demographics or imaging characteristics could limit the model's generalization.
Report and VQA Quality: The paper states that the validation set has been manually verified, but the quality assurance process for the full dataset is not described. Errors or inconsistencies in the textual annotations could negatively impact model training and performance.
Clinical Relevance: The paper focuses on the technical aspects of the dataset, but does not discuss how the resulting models would be evaluated or deployed in real-world clinical settings. Careful consideration of clinical workflows and end-user needs is crucial for developing AI systems that are truly useful for healthcare professionals.
Privacy and Ethics: The paper does not address the important issues of patient privacy and ethical use of medical data. Robust data anonymization and responsible AI development practices will be essential for deploying these models in clinical practice.

Overall, the RadGenome-Chest CT dataset represents an important step forward in enabling the development of more advanced, multimodal AI models for medical imaging. However, further research is needed to address the dataset's limitations and ensure the resulting models are clinically relevant, safe, and ethical.

Conclusion

This paper introduces a comprehensive new dataset called RadGenome-Chest CT that is designed to facilitate the development of generalist foundation models for medical image understanding. By providing rich, multimodal supervision signals - including organ-level segmentation masks, grounded textual reports, and visual question-answering pairs - the dataset aims to enable AI models to learn how to associate visual evidence with relevant textual descriptions and explanations.

The researchers believe that this dataset can significantly advance the state of the art in multimodal medical AI, moving beyond specialized models towards more general-purpose systems that can handle a wider range of medical imaging tasks. While the dataset represents an important step forward, further research is needed to address potential limitations around dataset representativeness, annotation quality, clinical relevance, and ethical considerations. Nonetheless, the release of RadGenome-Chest CT is an exciting development that is likely to spur further innovation in the field of AI for Medicine.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👨‍🏫

RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Jiayu Lei, Ya Zhang, Yanfeng Wang, Weidi Xie

Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.

4/26/2024

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

Weiwei Cao, Jianpeng Zhang, Yingda Xia, Tony C. W. Mok, Zi Li, Xianghua Ye, Le Lu, Jian Zheng, Yuxing Tang, Ling Zhang

Radiologists highly desire fully automated versatile AI for medical imaging interpretation. However, the lack of extensively annotated large-scale multi-disease datasets has hindered the achievement of this goal. In this paper, we explore the feasibility of leveraging language as a naturally high-quality supervision for chest CT imaging. In light of the limited availability of image-report pairs, we bootstrap the understanding of 3D chest CT images by distilling chest-related diagnostic knowledge from an extensively pre-trained 2D X-ray expert model. Specifically, we propose a language-guided retrieval method to match each 3D CT image with its semantically closest 2D X-ray image, and perform pair-wise and semantic relation knowledge distillation. Subsequently, we use contrastive learning to align images and reports within the same patient while distinguishing them from the other patients. However, the challenge arises when patients have similar semantic diagnoses, such as healthy patients, potentially confusing if treated as negatives. We introduce a robust contrastive learning that identifies and corrects these false negatives. We train our model with over 12,000 pairs of chest CT images and radiology reports. Extensive experiments across multiple scenarios, including zero-shot learning, report generation, and fine-tuning processes, demonstrate the model's feasibility in interpreting chest CT images.

4/9/2024

🔗

Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray

Qiao Deng, Zhongzhen Huang, Yunqi Wang, Zhichuan Wang, Zhao Wang, Xiaofan Zhang, Qi Dou, Yeung Yu Hui, Edward S. Hui

Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit the global and local alignment between medical image and text could however be marred by the redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge is grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between anatomical region-level visual features and the textural features of medical knowledge. The performance of GK-MVLP is competitive with or exceeds the state of the art on downstream chest X-ray disease classification, disease localization, report generation, and medical visual question-answering tasks. Our results show the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.

4/24/2024

CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting

Naman Sharma

Recently large vision-language models have shown potential when interpreting complex images and generating natural language descriptions using advanced reasoning. Medicine's inherently multimodal nature incorporating scans and text-based medical histories to write reports makes it conducive to benefit from these leaps in AI capabilities. We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation across several datasets and benchmarks. We use linear probes to evaluate the performance of various components including CheXagent's vision transformer and Q-former, which outperform the industry-standard Torch X-ray Vision models across many different datasets showing robust generalisation capabilities. Importantly, we find that vision-language models often hallucinate with confident language, which slows down clinical interpretation. Based on these findings, we develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools to generate uncertainty-aware radiology reports with pathologies localised and described based on their likelihood. We thoroughly evaluate our vision-language agents using NLP metrics, chest X-ray benchmarks and clinical evaluations by developing an evaluation platform to perform a user study with respiratory specialists. Our results show considerable improvements in accuracy, interpretability and safety of the AI-generated reports. We stress the importance of analysing results for normal and abnormal scans separately. Finally, we emphasise the need for larger paired (scan and report) datasets alongside data augmentation to tackle overfitting seen in these large vision-language models.

7/15/2024