PLUTO: Pathology-Universal Transformer

Read original: arXiv:2405.07905 - Published 5/14/2024 by Dinkar Juyal, Harshith Padigela, Chintan Shah, Daniel Shenker, Natalia Harguindeguy, Yi Liu, Blake Martin, Yibo Zhang, Michael Nercessian, Miles Markey and 23 others

🔄

Overview

The paper proposes a lightweight "Pathology Universal Transformer" (PLUTO) - a pre-trained model that can be used for a variety of pathology image analysis tasks.
Pathology images present unique challenges for computer vision due to their massive size and complexity, containing millions of objects of interest across multiple resolutions.
PLUTO is pre-trained on a diverse dataset of 195 million pathology image tiles and can extract meaningful representations across multiple scales to enable a range of downstream tasks.

Plain English Explanation

Pathology is the study of diseases by examining tissue samples under a microscope. Pathology images provide a special challenge for AI systems because they are extremely large, often containing millions of small objects that need to be analyzed. This paper introduces PLUTO, a pre-trained AI model that can be used for a variety of pathology image analysis tasks.

PLUTO was trained on a huge dataset of 195 million tiny image patches from pathology slides, allowing it to learn general features and representations that are useful across many different pathology-related tasks. This includes things like identifying specific cell types, classifying tissue regions, and making diagnoses at the whole slide level.

The key advantage of PLUTO is that it provides a "universal" set of features that can be easily adapted to new pathology tasks, rather than requiring a separate model to be trained from scratch for each new application. This makes PLUTO more efficient and practical to deploy in real-world clinical settings.

Technical Explanation

The paper introduces the PathoLogy Universal TransfOrmer (PLUTO), a lightweight pre-trained model for pathology image analysis. Pathology Whole Slide Images (WSIs) present a unique challenge for computer vision, as they are gigapixel-sized and can contain hundreds of thousands to millions of objects of interest across multiple resolutions.

To address this, the authors pre-train PLUTO on a diverse dataset of 195 million image tiles collected from multiple sites. This allows PLUTO to extract meaningful representations across multiple WSI scales, enabling it to perform a variety of downstream pathology tasks. The authors design task-specific adaptation heads that leverage PLUTO's output embeddings for applications ranging from subcellular instance segmentation to slide-level prediction.

PLUTO's performance is compared to other state-of-the-art methods on a diverse set of external and internal benchmarks covering different tissue types, resolutions, stains, and scanners. The results show that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, even when those models use orders-of-magnitude larger datasets and model sizes.

The findings demonstrate the potential of a universal pathology embedding to power a wide range of image analysis tasks. This motivates further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability.

Critical Analysis

The paper presents a compelling approach to address the unique challenges of pathology image analysis using a pre-trained "foundation model" like PLUTO. However, there are a few potential limitations and areas for further research:

Data Diversity: While the 195 million image tile dataset used for pre-training is large, it may not capture the full diversity of pathology images seen in real-world clinical practice. Expanding the pre-training dataset to include images from a broader range of sources, tissue types, and disease states could further improve PLUTO's generalization.
Architectural Improvements: The paper describes PLUTO as a "lightweight" model, but does not provide details on its exact architecture or size. Exploring more sophisticated model designs or hybrid approaches that combine PLUTO with other specialized modules could potentially boost performance further.
Sample Efficiency: While PLUTO outperforms larger task-specific models, it's not clear how much fine-tuning data is required to achieve good performance on new tasks. Improving the sample efficiency of the adaptation process could make PLUTO more practical for real-world clinical deployment.

Overall, the PLUTO model represents an important step towards more practical and scalable pathology image analysis, but continued research is needed to fully realize the potential of pathology foundation models.

Conclusion

The PLUTO model proposed in this paper demonstrates the promise of a "universal" pre-trained representation for pathology image analysis. By pre-training on a large, diverse dataset, PLUTO is able to extract meaningful features across multiple scales of pathology WSIs, enabling it to perform well on a variety of downstream tasks.

The authors' findings suggest that pathology foundation models like PLUTO could serve as a powerful, efficient foundation for a wide range of clinical image analysis applications. This motivates further research into improving the data diversity, architecture, and sample efficiency of these models to make them more practical for real-world deployment.

As the field of computational pathology continues to advance, approaches like PLUTO could play a crucial role in unlocking the full potential of AI-powered tools to assist pathologists and improve patient outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

PLUTO: Pathology-Universal Transformer

Dinkar Juyal, Harshith Padigela, Chintan Shah, Daniel Shenker, Natalia Harguindeguy, Yi Liu, Blake Martin, Yibo Zhang, Michael Nercessian, Miles Markey, Isaac Finberg, Kelsey Luu, Daniel Borders, Syed Ashar Javed, Emma Krause, Raymond Biju, Aashish Sood, Allen Ma, Jackson Nyman, John Shamshoian, Guillaume Chhor, Darpan Sanghavi, Marc Thibault, Limin Yu, Fedaa Najdawi, Jennifer A. Hipp, Darren Fahy, Benjamin Glass, Eric Walk, John Abel, Harsha Pokkalla, Andrew H. Beck, Sean Grullon

Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications.

5/14/2024

Hibou: A Family of Foundational Vision Transformers for Pathology

Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova

Pathology, the microscopic examination of diseased tissue, is critical for diagnosing various medical conditions, particularly cancers. Traditional methods are labor-intensive and prone to human error. Digital pathology, which converts glass slides into high-resolution digital images for analysis by computer algorithms, revolutionizes the field by enhancing diagnostic accuracy, consistency, and efficiency through automated image analysis and large-scale data processing. Foundational transformer pretraining is crucial for developing robust, generalizable models as it enables learning from vast amounts of unannotated data. This paper introduces the Hibou family of foundational vision transformers for pathology, leveraging the DINOv2 framework to pretrain two model variants, Hibou-B and Hibou-L, on a proprietary dataset of over 1 million whole slide images (WSIs) representing diverse tissue types and staining techniques. Our pretrained models demonstrate superior performance on both patch-level and slide-level benchmarks, surpassing existing state-of-the-art methods. Notably, Hibou-L achieves the highest average accuracy across multiple benchmark datasets. To support further research and application in the field, we have open-sourced the Hibou models, which can be accessed at https://github.com/HistAI/hibou.

8/21/2024

PathAlign: A vision-language model for whole slide images in histopathology

Faruk Ahmed, Andrew Sellergren, Lin Yang, Shawn Xu, Boris Babenko, Abbi Ward, Niels Olson, Arash Mohtashamian, Yossi Matias, Greg S. Corrado, Quang Duong, Dale R. Webster, Shravya Shetty, Daniel Golden, Yun Liu, David F. Steiner, Ellery Wulczyn

Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

7/1/2024

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Shu Yang, Huangjing Lin, Xin Wang, Jiguang Wang, Li Liang, Anjia Han, Ronald Cheong Kin Chan, Hao Chen

Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.

7/23/2024