PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Read original: arXiv:2408.07037 - Published 8/14/2024 by Xiaomin Wu, Rui Xu, Pengchen Wei, Wenkang Qin, Peixiang Huang, Ziheng Li, Lin Luo

PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Overview

This paper presents PathInsight, a framework for instruction tuning of multimodal datasets and models for intelligence-assisted diagnosis in histopathology.
It explores using large language models (LLMs) for computational pathology tasks, with a focus on multimodal data and instruction tuning.
The key contributions include developing datasets and models for histopathology diagnosis, and demonstrating how instruction tuning can improve the performance of LLMs on these tasks.

Plain English Explanation

The paper describes a new framework called PathInsight that uses large language models (LLMs) to assist with medical diagnosis based on histopathology images and related text data. Histopathology is the study of diseased tissue samples under a microscope, and it's a crucial part of medical diagnosis.

The researchers wanted to see if they could train LLMs, which are powerful AI models that excel at understanding and generating human language, to help with histopathology analysis. They developed specialized datasets that combine histopathology images with relevant text information. Then, they used a technique called "instruction tuning" to fine-tune the LLMs so they could better understand and work with this multimodal (image + text) data.

The key idea is that by training the LLMs to follow specific instructions related to histopathology tasks, the models can become more effective at assisting doctors and pathologists with disease diagnosis. For example, the LLMs could be instructed to analyze a histopathology slide, identify key features, and provide a preliminary diagnosis, all in a way that is easy for human experts to understand and verify.

Technical Explanation

The paper introduces PathInsight, a framework for instruction tuning of multimodal datasets and models for intelligence-assisted diagnosis in histopathology. The researchers developed specialized datasets that combine histopathology images with relevant text data, covering topics like disease types, tissue characteristics, and diagnostic criteria.

They then used instruction tuning to fine-tune large language models (LLMs) like GPT-3 on these multimodal histopathology datasets. Instruction tuning involves training the LLMs to follow specific prompts or instructions related to the target tasks, in this case histopathology analysis and diagnosis.

The key innovation is leveraging the powerful language understanding and generation capabilities of LLMs and adapting them to the multimodal world of histopathology. By training the models to understand and generate text that is relevant to the visual information in the histopathology slides, the researchers were able to create LLMs that can assist pathologists and doctors in their diagnostic work.

The paper demonstrates the effectiveness of this approach through a series of experiments, showing how instruction-tuned LLMs can outperform baseline models on a range of histopathology tasks, including identifying disease patterns, providing differential diagnoses, and generating relevant text descriptions.

Critical Analysis

The paper presents a promising approach to leveraging large language models for computational pathology, but it also acknowledges several limitations and areas for future research.

One key limitation is the reliance on curated, high-quality multimodal datasets, which can be challenging and expensive to obtain in the medical domain. The paper mentions that the datasets used in the study were relatively small, and the researchers note the need for larger and more diverse datasets to fully unleash the potential of instruction-tuned LLMs.

Additionally, while the experiments demonstrate improved performance on specific histopathology tasks, the paper does not address the broader challenges of deploying such systems in real-world clinical settings. Factors like interpretability, reliability, and integration with existing medical workflows would need to be carefully considered.

Further research is also needed to explore the generalizability of the instruction tuning approach beyond histopathology, and to investigate the potential ethical and privacy implications of using LLMs in sensitive medical domains.

Conclusion

The PathInsight framework presented in this paper represents an important step towards leveraging the power of large language models for intelligence-assisted diagnosis in histopathology. By developing specialized multimodal datasets and using instruction tuning, the researchers have shown how LLMs can be adapted to excel at a range of pathology-related tasks.

While further research is needed to address the limitations and expand the capabilities of this approach, the paper highlights the potential of this technology to enhance the work of pathologists and clinicians, ultimately leading to improved patient outcomes. As AI continues to advance, frameworks like PathInsight may play a crucial role in the future of computational pathology and medical diagnosis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Xiaomin Wu, Rui Xu, Pengchen Wei, Wenkang Qin, Peixiang Huang, Ziheng Li, Lin Luo

Pathological diagnosis remains the definitive standard for identifying tumors. The rise of multimodal large models has simplified the process of integrating image analysis with textual descriptions. Despite this advancement, the substantial costs associated with training and deploying these complex multimodal models, together with a scarcity of high-quality training datasets, create a significant divide between cutting-edge technology and its application in the clinical setting. We had meticulously compiled a dataset of approximately 45,000 cases, covering over 6 different tasks, including the classification of organ tissues, generating pathology report descriptions, and addressing pathology-related questions and answers. We have fine-tuned multimodal large models, specifically LLaVA, Qwen-VL, InternLM, with this dataset to enhance instruction-based performance. We conducted a qualitative assessment of the capabilities of the base model and the fine-tuned model in performing image captioning and classification tasks on the specific dataset. The evaluation results demonstrate that the fine-tuned model exhibits proficiency in addressing typical pathological questions. We hope that by making both our models and datasets publicly available, they can be valuable to the medical and research communities.

8/14/2024

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Shu Yang, Huangjing Lin, Xin Wang, Jiguang Wang, Li Liang, Anjia Han, Ronald Cheong Kin Chan, Hao Chen

Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.

7/23/2024

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

Mehmet Saygin Seyfioglu, Wisdom O. Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, Linda Shapiro

Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider view of the WSI. Therefore, they lack sufficient diagnostic capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. Quilt-Instruct supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. Our code, data, and model are publicly accessible at quilt-llava.github.io.

4/11/2024

Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning

Chen Shen, Chunfeng Lian, Wanqing Zhang, Fan Wang, Jianhua Zhang, Shuanliang Fan, Xin Wei, Gongji Wang, Kehan Li, Hongshu Mu, Hao Wu, Xinggong Liang, Jianhua Ma, Zhenyuan Wang

Forensic pathology is critical in determining the cause and manner of death through post-mortem examinations, both macroscopic and microscopic. The field, however, grapples with issues such as outcome variability, laborious processes, and a scarcity of trained professionals. This paper presents SongCi, an innovative visual-language model (VLM) designed specifically for forensic pathology. SongCi utilizes advanced prototypical cross-modal self-supervised contrastive learning to enhance the accuracy, efficiency, and generalizability of forensic analyses. It was pre-trained and evaluated on a comprehensive multi-center dataset, which includes over 16 million high-resolution image patches, 2,228 vision-language pairs of post-mortem whole slide images (WSIs), and corresponding gross key findings, along with 471 distinct diagnostic outcomes. Our findings indicate that SongCi surpasses existing multi-modal AI models in many forensic pathology tasks, performs comparably to experienced forensic pathologists and significantly better than less experienced ones, and provides detailed multi-modal explainability, offering critical assistance in forensic investigations. To the best of our knowledge, SongCi is the first VLM specifically developed for forensic pathological analysis and the first large-vocabulary computational pathology (CPath) model that directly processes gigapixel WSIs in forensic science.

7/23/2024