WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Read original: arXiv:2407.05603 - Published 7/9/2024 by Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, Lin Yang

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Overview

This paper introduces WSI-VQA, a framework for interpreting whole slide images (WSIs) using generative visual question answering (VQA).
WSIs are high-resolution digital scans of tissue samples used in digital pathology, and VQA involves answering questions about an image.
The key innovation is using a generative VQA model to generate informative captions that describe the visual content of WSIs, helping pathologists interpret these complex images.

Plain English Explanation

The paper introduces a new way to help pathologists understand whole slide images (WSIs) - high-resolution digital scans of tissue samples used in medical diagnosis. These WSIs can be very complex and difficult for pathologists to interpret.

The researchers developed a system called WSI-VQA that uses a type of artificial intelligence called "generative visual question answering" (VQA) to automatically generate informative captions that describe the visual content of the WSIs. This is similar to how image captioning systems can describe regular photos.

By providing these AI-generated captions, the WSI-VQA system aims to make it easier for pathologists to interpret the complex information contained in whole slide images. This could potentially help improve medical diagnosis and treatment based on these important digital pathology tools.

Technical Explanation

The core innovation of the paper is the WSI-VQA framework, which adapts generative VQA techniques to the domain of whole slide imaging. In VQA, the goal is to generate natural language answers in response to questions about visual content.

The WSI-VQA model takes a WSI as input and generates textual captions that describe the visual features and patterns present in the image. This is enabled by a multi-stage architecture that first encodes the WSI using a spatial pyramid pooling network, then uses a Transformer-based language model to generate the captions.

The researchers constructed a new dataset called PathVQA to train and evaluate WSI-VQA. This dataset contains over 60,000 question-answer pairs about pathology WSIs, with a focus on questions that require holistic interpretation of the visual content.

Experiments show that WSI-VQA outperforms prior work on PathVQA, demonstrating the benefits of the generative VQA approach for interpreting complex WSIs. The generated captions provide rich, informative descriptions that could aid pathologists in their diagnostic workflows.

Critical Analysis

A key strength of the WSI-VQA framework is its ability to provide interpretable, natural language descriptions of whole slide images. This addresses an important challenge in digital pathology, where the sheer complexity of WSIs can make them difficult for pathologists to quickly understand and analyze.

However, the paper acknowledges some limitations of the current approach. The captions generated by WSI-VQA, while informative, may not capture all the nuanced visual information that a pathologist would want to consider. There is also room for improvement in the model's ability to answer more specific, technical questions about the WSI content.

Additionally, the PathVQA dataset, while a valuable contribution, may not fully reflect the diversity of real-world pathology cases and questions. Expanding the dataset and further evaluating WSI-VQA on clinical workflows could help validate its practical utility.

Overall, the WSI-VQA work represents an exciting step towards making whole slide imaging more accessible and interpretable for pathologists. Continued research in this direction, along with close collaboration with domain experts, could lead to significant advancements in digital pathology and medical diagnosis.

Conclusion

This paper presents WSI-VQA, a novel framework for interpreting whole slide images using generative visual question answering. By automatically generating informative captions that describe the visual content of WSIs, WSI-VQA aims to assist pathologists in quickly understanding and analyzing these complex digital pathology images.

The key innovation is the adaptation of generative VQA techniques to the WSI domain, enabled by a multi-stage architecture and a new dataset called PathVQA. Experiments demonstrate the effectiveness of WSI-VQA, suggesting its potential to enhance pathologists' diagnostic workflows and improve medical diagnosis and treatment.

While the current approach has some limitations, this work represents an important step towards making whole slide imaging more accessible and interpretable. Further research and collaboration with domain experts could lead to even more powerful AI-powered tools for digital pathology, with far-reaching implications for healthcare and medical discovery.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, Lin Yang

Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images (WSI). The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs by generative visual question answering. WSI-VQA shows universality by reframing various kinds of slide-level tasks in a question-answering pattern, in which pathologists can achieve immunohistochemical grading, survival prediction, and tumor subtyping following human-machine interaction. Furthermore, we establish a WSI-VQA dataset which contains 8672 slide-level question-answering pairs with 977 WSIs. Besides the ability to deal with different slide-level tasks, our generative model which is named Wsi2Text Transformer (W2T) outperforms existing discriminative models in medical correctness, which reveals the potential of our model to be applied in the clinical scenario. Additionally, we also visualize the co-attention mapping between word embeddings and WSIs as an intuitive explanation for diagnostic results. The dataset and related code are available at https://github.com/cpystan/WSI-VQA.

7/9/2024

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

Hao Li, Ying Chen, Yifei Chen, Wenxian Yang, Bowen Ding, Yuchen Han, Liansheng Wang, Rongshan Yu

Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However, existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision, which are insufficient to capture the complex visual appearance of pathogenetic images, hindering the generalizability of models on diverse downstream tasks. Additionally, processing high-resolution WSIs can be computationally expensive. In this paper, we propose a novel Fine-grained Visual-Semantic Interaction (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics. Specifically, with meticulously designed queries, we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module, we enable prompts to capture crucial visual information in WSIs, which enhances representation learning and augments generalization capabilities significantly. Furthermore, given that pathological visual patterns are redundantly distributed across tissue slices, we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The code is available at: https://github.com/ls1rius/WSI_FiVE.

4/8/2024

PathAlign: A vision-language model for whole slide images in histopathology

Faruk Ahmed, Andrew Sellergren, Lin Yang, Shawn Xu, Boris Babenko, Abbi Ward, Niels Olson, Arash Mohtashamian, Yossi Matias, Greg S. Corrado, Quang Duong, Dale R. Webster, Shravya Shetty, Daniel Golden, Yun Liu, David F. Steiner, Ellery Wulczyn

Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

7/1/2024

🛸

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images

Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, Lin Yang

Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (PathText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues and achieve competitive performance on certain slide-level tasks. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping surpassing previous state-of-the-art approaches. Our collected dataset and related code are available.

6/28/2024