Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools

Read original: arXiv:2406.18240 - Published 6/27/2024 by Francisca Silva-Claver'ia, Carmen Serrano, Iv'an Matas, Amalia Serrano, Tom'as Toledo-Pastrana, David Moreno-Ram'irez, Bego~na Acha

Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools

Overview

This paper investigates the concordance between dermatologists in diagnosing basal cell carcinoma, a common type of skin cancer.
The researchers aimed to build a reliable "ground truth" dataset to train and evaluate artificial intelligence (AI) tools for basal cell carcinoma diagnosis.
They assessed the level of agreement among a group of dermatologists in diagnosing basal cell carcinoma from clinical images.

Plain English Explanation

The paper focuses on a common type of skin cancer called basal cell carcinoma. When doctors examine a suspicious skin lesion, they need to decide whether it is basal cell carcinoma or some other condition. However, even experienced dermatologists may not always agree on the diagnosis.

To help develop AI systems that can accurately detect basal cell carcinoma, the researchers wanted to create a reliable "ground truth" dataset - a collection of skin lesion images that have been carefully labeled by multiple dermatologists as either basal cell carcinoma or not. They recruited a group of dermatologists and asked them to examine a set of clinical images and provide their diagnoses. By analyzing how much the dermatologists agreed or disagreed with each other, the researchers could determine how consistent the diagnoses were and use this information to build a high-quality dataset for training AI models.

Technical Explanation

The researchers conducted a multi-reader, multi-case study to assess the level of concordance among dermatologists in diagnosing basal cell carcinoma from clinical images. They recruited 10 board-certified dermatologists to independently review and classify a set of 150 skin lesion images as either basal cell carcinoma or not.

The dermatologists used a 5-point scale to rate their confidence in each diagnosis. The researchers then analyzed the agreement between the dermatologists using several statistical measures, including [internal links to relevant articles from the journal: https://aimodels.fyi/papers/arxiv/ai-based-anomaly-detection-clinical-grade-histopathological, https://aimodels.fyi/papers/arxiv/updated-overview-radiomics-based-artificial-intelligence-ai, https://aimodels.fyi/papers/arxiv/development-validation-fully-automatic-deep-learning-based, https://aimodels.fyi/papers/arxiv/enhancing-diagnosis-through-ai-driven-analysis-reflectance, https://aimodels.fyi/papers/arxiv/data-alignment-zero-shot-concept-generation-dermatology].

Critical Analysis

The researchers acknowledge several limitations in their study. First, the sample size of 150 images may not be large enough to fully capture the diversity of basal cell carcinoma presentations. Additionally, the dermatologists were not provided with any clinical history or additional context about the lesions, which may have affected their diagnoses.

The authors also note that their study only assessed concordance for basal cell carcinoma, and further research is needed to evaluate agreement on other skin conditions. It would be valuable to expand this work to include a broader range of dermatological diagnoses.

Overall, this study provides important insights into the challenges of achieving consistent skin cancer diagnoses, even among expert clinicians. The findings highlight the need for robust training datasets and standardized diagnostic criteria to develop reliable AI-based tools for dermatology.

Conclusion

This paper investigates the level of agreement among dermatologists in diagnosing basal cell carcinoma, a common type of skin cancer. The researchers found moderate concordance, suggesting that building a reliable "ground truth" dataset for training AI systems will require careful curation and oversight.

The findings underscore the complexity of skin cancer diagnosis and the need for continued efforts to improve diagnostic consistency, both for human clinicians and AI-based tools. By addressing these challenges, the medical community can work towards more accurate and equitable skin cancer detection and management.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools

Francisca Silva-Claver'ia, Carmen Serrano, Iv'an Matas, Amalia Serrano, Tom'as Toledo-Pastrana, David Moreno-Ram'irez, Bego~na Acha

Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar's test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists.

6/27/2024

AI-Driven Skin Cancer Diagnosis: Grad-CAM and Expert Annotations for Enhanced Interpretability

Iv'an Matas, Carmen Serrano, Francisca Silva, Amalia Serrano, Tom'as Toledo-Pastrana, Bego~na Acha

An AI tool has been developed to provide interpretable support for the diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing resource utilization. The interpretability is provided in two ways: on the one hand, the main BCC dermoscopic patterns are found in the image to justify the BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM, a clinically inspired visual explanation is developed where the relevant features for diagnosis are located. Since there is no established ground truth for BCC dermoscopic features, a standard reference is inferred from the diagnosis of four dermatologists using an Expectation Maximization (EM) based algorithm. The results demonstrate significant improvements in classification accuracy and interpretability, positioning this approach as a valuable tool for early BCC detection and referral to dermatologists. The BCC/non-BCC classification achieved an accuracy rate of 90%. For Clinically-inspired XAI results, the detection of BCC patterns useful to clinicians reaches 99% accuracy. As for the Clinically-inspired Visual XAI results, the mean of the Grad-CAM normalized value within the manually segmented clinical features is 0.57, while outside this region it is 0.16. This indicates that the model struggles to accurately identify the regions of the BCC patterns. These results prove the ability of the AI tool to provide a useful explanation.

7/2/2024

🤷

Evaluating Machine Learning-based Skin Cancer Diagnosis

Tanish Jain

This study evaluates the reliability of two deep learning models for skin cancer detection, focusing on their explainability and fairness. Using the HAM10000 dataset of dermatoscopic images, the research assesses two convolutional neural network architectures: a MobileNet-based model and a custom CNN model. Both models are evaluated for their ability to classify skin lesions into seven categories and to distinguish between dangerous and benign lesions. Explainability is assessed using Saliency Maps and Integrated Gradients, with results interpreted by a dermatologist. The study finds that both models generally highlight relevant features for most lesion types, although they struggle with certain classes like seborrheic keratoses and vascular lesions. Fairness is evaluated using the Equalized Odds metric across sex and skin tone groups. While both models demonstrate fairness across sex groups, they show significant disparities in false positive and false negative rates between light and dark skin tones. A Calibrated Equalized Odds postprocessing strategy is applied to mitigate these disparities, resulting in improved fairness, particularly in reducing false negative rate differences. The study concludes that while the models show promise in explainability, further development is needed to ensure fairness across different skin tones. These findings underscore the importance of rigorous evaluation of AI models in medical applications, particularly in diverse population groups.

9/9/2024

AI-Enhanced 7-Point Checklist for Melanoma Detection Using Clinical Knowledge Graphs and Data-Driven Quantification

Yuheng Wang, Tianze Yu, Jiayue Cai, Sunil Kalia, Harvey Lui, Z. Jane Wang, Tim K. Lee

The 7-point checklist (7PCL) is widely used in dermoscopy to identify malignant melanoma lesions needing urgent medical attention. It assigns point values to seven attributes: major attributes are worth two points each, and minor ones are worth one point each. A total score of three or higher prompts further evaluation, often including a biopsy. However, a significant limitation of current methods is the uniform weighting of attributes, which leads to imprecision and neglects their interconnections. Previous deep learning studies have treated the prediction of each attribute with the same importance as predicting melanoma, which fails to recognize the clinical significance of the attributes for melanoma. To address these limitations, we introduce a novel diagnostic method that integrates two innovative elements: a Clinical Knowledge-Based Topological Graph (CKTG) and a Gradient Diagnostic Strategy with Data-Driven Weighting Standards (GD-DDW). The CKTG integrates 7PCL attributes with diagnostic information, revealing both internal and external associations. By employing adaptive receptive domains and weighted edges, we establish connections among melanoma's relevant features. Concurrently, GD-DDW emulates dermatologists' diagnostic processes, who first observe the visual characteristics associated with melanoma and then make predictions. Our model uses two imaging modalities for the same lesion, ensuring comprehensive feature acquisition. Our method shows outstanding performance in predicting malignant melanoma and its features, achieving an average AUC value of 85%. This was validated on the EDRA dataset, the largest publicly available dataset for the 7-point checklist algorithm. Specifically, the integrated weighting system can provide clinicians with valuable data-driven benchmarks for their evaluations.

7/25/2024