DiagSet: a dataset for prostate cancer histopathological image classification

Read original: arXiv:2105.04014 - Published 6/4/2024 by Micha{l} Koziarski, Bogus{l}aw Cyganek, Przemys{l}aw Niedziela, Bogus{l}aw Olborski, Zbigniew Antosz, Marcin .Zydak, Bogdan Kwolek, Pawe{l} Wk{a}sowicz, Andrzej Buka{l}a, Jakub Swad'zba and 1 other

🖼️

Overview

Novel dataset of over 2.6 million tissue patches for prostate cancer detection
Proposed machine learning framework for cancer detection and diagnosis prediction
Achieves 94.6% accuracy in patch-level recognition, compared to 9 human pathologists

Plain English Explanation

Cancer is one of the biggest health challenges we face as a society. In this research paper, the authors introduce a new dataset of tissue samples that can be used to help detect prostate cancer. The dataset contains over 2.6 million individual tissue patches taken from 430 fully annotated scans, as well as 4,675 scans with binary diagnoses and 46 scans with diagnoses provided by a group of pathologists.

The authors also propose a machine learning approach that can be used to automatically detect cancerous tissue regions and predict whether a full scan indicates the presence of cancer. This approach uses ensembles of deep neural networks to analyze the tissue samples at different scales. It achieves an impressive 94.6% accuracy in identifying cancerous tissue patches, and the scan-level diagnosis performance was found to be highly consistent with that of 9 human pathologists.

By making this new dataset publicly available and demonstrating the potential of machine learning for prostate cancer detection, this research has important implications for improving the interpretability of AI systems for cancer diagnosis and advancing the field of computational pathology. It could lead to more accurate and efficient screening, and potentially better patient outcomes.

Technical Explanation

The authors have created a novel histopathological dataset for prostate cancer detection, which consists of over 2.6 million tissue patches extracted from 430 fully annotated scans, 4,675 scans with assigned binary diagnoses, and 46 scans with diagnoses independently provided by a group of histopathologists. This dataset, available at https://github.com/michalkoziarski/DiagSet, represents a valuable resource for training and evaluating machine learning models for cancer detection.

The authors also propose a machine learning framework that utilizes ensembles of deep neural networks operating on the histopathological scans at different scales. This approach achieves 94.6% accuracy in patch-level recognition of cancerous tissue, and is compared to the scan-level diagnosis of 9 human histopathologists, showing high statistical agreement.

The framework includes a thresholding mechanism that allows the model to abstain from making a decision in cases where it is not confident, which can help improve the overall reliability of the system. This relates to the broader challenge of developing interpretable AI systems for cancer diagnosis.

Critical Analysis

The authors acknowledge several limitations and areas for future research in the paper. For example, they note that the dataset is limited to prostate cancer and may not generalize well to other cancer types. Additionally, the 46 scans with independent pathologist diagnoses represent a relatively small sample size, and further validation on a larger, more diverse set of cases would be beneficial.

One potential concern not addressed in the paper is the potential for bias in the dataset, as it is unclear how the tissue samples were selected and annotated. It would be important to ensure that the dataset is representative of the broader population and does not overrepresent certain demographic groups or cancer subtypes.

Overall, the research presented in this paper is a valuable contribution to the field of computational pathology and the use of AI for cancer diagnosis. The dataset and the proposed machine learning framework have the potential to significantly improve the accuracy and efficiency of prostate cancer detection, which could lead to earlier diagnosis and better patient outcomes. However, as with any AI-based system, it will be crucial to carefully evaluate its performance, interpretability, and potential biases before deploying it in a clinical setting.

Conclusion

This research paper introduces a novel histopathological dataset for prostate cancer detection and a machine learning framework for automated cancer detection and diagnosis. The dataset, consisting of over 2.6 million tissue patches, represents a valuable resource for the field, while the proposed framework achieves impressive accuracy in identifying cancerous tissue and predicting scan-level diagnoses.

The authors' work has important implications for advancing the use of AI in computational pathology and improving the interpretability and reliability of such systems. By making the dataset publicly available and demonstrating the potential of their approach, the researchers have taken a significant step towards developing more accurate and efficient tools for cancer screening and diagnosis, which could ultimately lead to better patient outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

DiagSet: a dataset for prostate cancer histopathological image classification

Micha{l} Koziarski, Bogus{l}aw Cyganek, Przemys{l}aw Niedziela, Bogus{l}aw Olborski, Zbigniew Antosz, Marcin .Zydak, Bogdan Kwolek, Pawe{l} Wk{a}sowicz, Andrzej Buka{l}a, Jakub Swad'zba, Piotr Sitkowski

Cancer diseases constitute one of the most significant societal challenges. In this paper, we introduce a novel histopathological dataset for prostate cancer detection. The proposed dataset, consisting of over 2.6 million tissue patches extracted from 430 fully annotated scans, 4675 scans with assigned binary diagnoses, and 46 scans with diagnoses independently provided by a group of histopathologists can be found at https://github.com/michalkoziarski/DiagSet. Furthermore, we propose a machine learning framework for detection of cancerous tissue regions and prediction of scan-level diagnosis, utilizing thresholding to abstain from the decision in uncertain cases. The proposed approach, composed of ensembles of deep neural networks operating on the histopathological scans at different scales, achieves 94.6% accuracy in patch-level recognition and is compared in a scan-level diagnosis with 9 human histopathologists showing high statistical agreement.

6/4/2024

🖼️

An interpretable machine learning system for colorectal cancer diagnosis from pathology slides

Pedro C. Neto, Diana Montezuma, Sara P. Oliveira, Domingos Oliveira, Jo~ao Fraga, Ana Monteiro, Jo~ao Monteiro, Liliana Ribeiro, Sofia Gonc{c}alves, Stefan Reinhard, Inti Zlobec, Isabel M. Pinto, Jaime S. Cardoso

Considering the profound transformation affecting pathology practice, we aimed to develop a scalable artificial intelligence (AI) system to diagnose colorectal cancer from whole-slide images (WSI). For this, we propose a deep learning (DL) system that learns from weak labels, a sampling strategy that reduces the number of training samples by a factor of six without compromising performance, an approach to leverage a small subset of fully annotated samples, and a prototype with explainable predictions, active learning features and parallelisation. Noting some problems in the literature, this study is conducted with one of the largest WSI colorectal samples dataset with approximately 10,500 WSIs. Of these samples, 900 are testing samples. Furthermore, the robustness of the proposed method is assessed with two additional external datasets (TCGA and PAIP) and a dataset of samples collected directly from the proposed prototype. Our proposed method predicts, for the patch-based tiles, a class based on the severity of the dysplasia and uses that information to classify the whole slide. It is trained with an interpretable mixed-supervision scheme to leverage the domain knowledge introduced by pathologists through spatial annotations. The mixed-supervision scheme allowed for an intelligent sampling strategy effectively evaluated in several different scenarios without compromising the performance. On the internal dataset, the method shows an accuracy of 93.44% and a sensitivity between positive (low-grade and high-grade dysplasia) and non-neoplastic samples of 0.996. On the external test samples varied with TCGA being the most challenging dataset with an overall accuracy of 84.91% and a sensitivity of 0.996.

5/2/2024

PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Xiaomin Wu, Rui Xu, Pengchen Wei, Wenkang Qin, Peixiang Huang, Ziheng Li, Lin Luo

Pathological diagnosis remains the definitive standard for identifying tumors. The rise of multimodal large models has simplified the process of integrating image analysis with textual descriptions. Despite this advancement, the substantial costs associated with training and deploying these complex multimodal models, together with a scarcity of high-quality training datasets, create a significant divide between cutting-edge technology and its application in the clinical setting. We had meticulously compiled a dataset of approximately 45,000 cases, covering over 6 different tasks, including the classification of organ tissues, generating pathology report descriptions, and addressing pathology-related questions and answers. We have fine-tuned multimodal large models, specifically LLaVA, Qwen-VL, InternLM, with this dataset to enhance instruction-based performance. We conducted a qualitative assessment of the capabilities of the base model and the fine-tuned model in performing image captioning and classification tasks on the specific dataset. The evaluation results demonstrate that the fine-tuned model exhibits proficiency in addressing typical pathological questions. We hope that by making both our models and datasets publicly available, they can be valuable to the medical and research communities.

8/14/2024

Self-Contrastive Weakly Supervised Learning Framework for Prognostic Prediction Using Whole Slide Images

Saul Fuster, Farbod Khoraminia, Julio Silva-Rodr'iguez, Umay Kiraz, Geert J. L. H. van Leenders, Trygve Eftest{o}l, Valery Naranjo, Emiel A. M. Janssen, Tahlita C. M. Zuiverloon, Kjersti Engan

We present a pioneering investigation into the application of deep learning techniques to analyze histopathological images for addressing the substantial challenge of automated prognostic prediction. Prognostic prediction poses a unique challenge as the ground truth labels are inherently weak, and the model must anticipate future events that are not directly observable in the image. To address this challenge, we propose a novel three-part framework comprising of a convolutional network based tissue segmentation algorithm for region of interest delineation, a contrastive learning module for feature extraction, and a nested multiple instance learning classification module. Our study explores the significance of various regions of interest within the histopathological slides and exploits diverse learning scenarios. The pipeline is initially validated on artificially generated data and a simpler diagnostic task. Transitioning to prognostic prediction, tasks become more challenging. Employing bladder cancer as use case, our best models yield an AUC of 0.721 and 0.678 for recurrence and treatment outcome prediction respectively.

5/27/2024