Benchmarking Pathology Feature Extractors for Whole Slide Image Classification

Read original: arXiv:2311.11772 - Published 6/24/2024 by Georg Wolflein (University of St Andrews, St Andrews, United Kingdom, Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany), Dyke Ferber (Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus and 45 others

✨

Overview

This paper explores a critical task in computational pathology: weakly supervised whole slide image classification.
The authors conduct a comprehensive benchmarking study to address three key questions:
1. Is stain normalization still necessary for preprocessing?
2. Which feature extractors perform best for downstream slide-level classification?
3. How does magnification affect downstream performance?
The study involves over 10,000 training runs across various setups, challenging existing assumptions in the field.

Plain English Explanation

In the field of computational pathology, one important task is [object Object]. This involves taking a digital image of an entire tissue sample (a "whole slide") and using machine learning to predict some property or characteristic of that sample, such as the presence of a certain disease.

To build models that can do this, researchers often have to make a lot of decisions about how to process the images and what algorithms to use. [object Object].

This paper aims to address that by conducting a very comprehensive study, looking at over 10,000 different experiments. The key questions they investigate are:

Do we still need to do "stain normalization" - a common preprocessing step to try to make images more consistent - or can we skip it?
Which feature extraction methods work best for feeding into the final classification model?
Does the magnification level of the images matter for getting accurate slide-level predictions?

The authors' findings challenge some existing assumptions in the field. For example, they find that skipping stain normalization and data augmentation doesn't hurt performance, while the choice of feature extractor is the most important factor for downstream classification. They also discover that lower magnification images can be just as useful as higher magnification ones.

Overall, this work has the potential to streamline digital pathology workflows by reducing the need for certain preprocessing steps and providing guidance on the best feature extractors to use.

Technical Explanation

The core task explored in this paper is weakly supervised whole slide image classification. This means taking a set of image "patches" that together make up a whole slide, and using those to predict some slide-level label or property, without having detailed annotations for each individual patch.

To investigate this task, the authors conduct an extensive benchmarking study. They evaluate over 10,000 training runs across a range of different setups, including:

14 different feature extractors (models for converting raw image data into more abstract representations)
9 different classification tasks
5 public pathology datasets
3 downstream classification architectures
2 levels of image magnification
various preprocessing approaches (with and without stain normalization and data augmentation)

Their goal is to answer three critical questions:

Is stain normalization still a necessary preprocessing step? The authors find that skipping stain normalization and data augmentation does not degrade performance, while significantly reducing memory and computational demands.
Which feature extractors perform best for downstream slide-level classification? The authors develop a novel evaluation metric to compare relative downstream performance, and show that the choice of feature extractor is the most consequential factor, more so than other architectural choices.
How does magnification affect downstream performance? Contrary to expectations, the authors find that lower-magnification slides are sufficient for accurate slide-level classification.

Unlike previous patch-level benchmarking studies, this work emphasizes clinical relevance by focusing on slide-level biomarker prediction tasks in a weakly supervised setting with external validation cohorts.

Critical Analysis

The authors acknowledge several caveats and limitations to their study. For example, they note that their findings may be specific to the particular datasets and tasks they examined, and that further research is needed to determine how generalizable the results are.

Additionally, while the authors provide a comprehensive evaluation of feature extractors, they do not explore the use of more recently developed techniques like self-supervised learning or contrastive learning as seen in this paper. It would be interesting to see how these newer approaches compare to the feature extractors tested here.

Another potential issue is that the authors' evaluation focuses solely on slide-level classification performance, without considering other important factors like interpretability or computational efficiency which are important for real-world deployment, as discussed in this paper. A more holistic assessment of the feature extractors could provide additional insights.

Finally, while the authors highlight the clinical relevance of their work, it's unclear how well their findings would translate to actual clinical practice. More research may be needed to understand the real-world implications and implementation challenges as discussed in this paper on stain normalization and this one on quality control for whole slide images.

Overall, this is a robust and valuable study that challenges existing assumptions in computational pathology. However, as with any research, there are opportunities for further exploration and refinement.

Conclusion

This comprehensive benchmarking study provides important insights for the field of computational pathology. The authors' findings suggest that some long-held assumptions about necessary preprocessing steps and the importance of image magnification may need to be reevaluated.

By demonstrating that stain normalization and data augmentation are not always required, and that lower-magnification images can be just as useful as higher-magnification ones, the authors have the potential to streamline digital pathology workflows and reduce computational demands. Furthermore, their identification of the critical role of feature extractors can help guide the selection of appropriate models for downstream classification tasks.

While this research has limitations and requires further validation, it represents a significant step forward in understanding the key factors that drive performance in weakly supervised whole slide image classification. As the field of computational pathology continues to evolve, studies like this will be instrumental in optimizing workflows, improving clinical decision-making, and ultimately benefiting patient outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

Benchmarking Pathology Feature Extractors for Whole Slide Image Classification

Georg Wolflein (University of St Andrews, St Andrews, United Kingdom, Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany), Dyke Ferber (Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany, Department of Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany), Asier R. Meneghetti (Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany), Omar S. M. El Nahhas (Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany), Daniel Truhn (University Hospital Aachen, Germany), Zunamys I. Carrero (Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany), David J. Harrison (University of St Andrews, St Andrews, United Kingdom), Ognjen Arandjelovi'c (University of St Andrews, St Andrews, United Kingdom), Jakob Nikolas Kather (Else Kroner Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany, Department of Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany, Department of Medicine I, University Hospital Dresden, Dresden, Germany)

Weakly supervised whole slide image classification is a key task in computational pathology, which involves predicting a slide-level label from a set of image patches constituting the slide. Constructing models to solve this task involves multiple design choices, often made without robust empirical or conclusive theoretical justification. To address this, we conduct a comprehensive benchmarking of feature extractors to answer three critical questions: 1) Is stain normalisation still a necessary preprocessing step? 2) Which feature extractors are best for downstream slide-level classification? 3) How does magnification affect downstream performance? Our study constitutes the most comprehensive evaluation of publicly available pathology feature extractors to date, involving more than 10,000 training runs across 14 feature extractors, 9 tasks, 5 datasets, 3 downstream architectures, 2 levels of magnification, and various preprocessing setups. Our findings challenge existing assumptions: 1) We observe empirically, and by analysing the latent space, that skipping stain normalisation and image augmentations does not degrade performance, while significantly reducing memory and computational demands. 2) We develop a novel evaluation metric to compare relative downstream performance, and show that the choice of feature extractor is the most consequential factor for downstream performance. 3) We find that lower-magnification slides are sufficient for accurate slide-level classification. Contrary to previous patch-level benchmarking studies, our approach emphasises clinical relevance by focusing on slide-level biomarker prediction tasks in a weakly supervised setting with external validation cohorts. Our findings stand to streamline digital pathology workflows by minimising preprocessing needs and informing the selection of feature extractors.

6/24/2024

🖼️

Whole Slide Image Survival Analysis Using Histopathological Feature Extractors

Kleanthis Marios Papadopoulos

The abundance of information present in Whole Slide Images (WSIs) makes them useful for prognostic evaluation. A large number of models utilizing a pretrained ResNet backbone have been released and employ various feature aggregation techniques, primarily based on Multiple Instance Learning (MIL). By leveraging the recently released UNI feature extractor, existing models can be adapted to achieve higher accuracy, which paves the way for more robust prognostic tools in digital pathology.

5/29/2024

Enhancing Whole Slide Pathology Foundation Models through Stain Normalization

Juseung Yun, Yi Hu, Jinhyung Kim, Jongseong Jang, Soonyoung Lee

Recent advancements in digital pathology have led to the development of numerous foundational models that utilize self-supervised learning on patches extracted from gigapixel whole slide images (WSIs). While this approach leverages vast amounts of unlabeled data, we have discovered a significant issue: features extracted from these self-supervised models tend to cluster by individual WSIs, a phenomenon we term WSI-specific feature collapse. This problem can potentially limit the model's generalization ability and performance on various downstream tasks. To address this issue, we introduce EXAONEPath, a novel foundational model trained on patches that have undergone stain normalization. Stain normalization helps reduce color variability arising from different laboratories and scanners, enabling the model to learn more consistent features. EXAONEPath is trained using 285,153,903 patches extracted from a total of 34,795 WSIs. Our experiments demonstrate that EXAONEPath significantly mitigates the feature collapse problem, indicating that the model has learned more generalized features rather than overfitting to individual WSI characteristics. We compared EXAONEPath with state-of-the-art models across six downstream task datasets, and our results show that EXAONEPath achieves superior performance relative to the number of WSIs used and the model's parameter count. This suggests that the application of stain normalization has substantially improved the model's efficiency and generalization capabilities.

8/23/2024

The Importance of Downstream Networks in Digital Pathology Foundation Models

Gustav Bredell, Marcel Fischer, Przemyslaw Szostak, Samaneh Abbasi-Sureshjani, Alvaro Gomariz

Digital pathology has significantly advanced disease detection and pathologist efficiency through the analysis of gigapixel whole-slide images (WSI). In this process, WSIs are first divided into patches, for which a feature extractor model is applied to obtain feature vectors, which are subsequently processed by an aggregation model to predict the respective WSI label. With the rapid evolution of representation learning, numerous new feature extractor models, often termed foundational models, have emerged. Traditional evaluation methods rely on a static downstream aggregation model setup, encompassing a fixed architecture and hyperparameters, a practice we identify as potentially biasing the results. Our study uncovers a sensitivity of feature extractor models towards aggregation model configurations, indicating that performance comparability can be skewed based on the chosen configurations. By accounting for this sensitivity, we find that the performance of many current feature extractor models is notably similar. We support this insight by evaluating seven feature extractor models across three different datasets with 162 different aggregation model configurations. This comprehensive approach provides a more nuanced understanding of the feature extractors' sensitivity to various aggregation model configurations, leading to a fairer and more accurate assessment of new foundation models in digital pathology.

8/6/2024