Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach






Published 4/1/2024 by Jos'e Bobes-Bascar'an (University of Coru~na), Eduardo Mosqueira-Rey (University of Coru~na), 'Angel Fern'andez-Leal (University of Coru~na), Elena Hern'andez-Pereira (University of Coru~na), David Alonso-R'ios (University of Coru~na), Vicente Moret-Bonillo (University of Coru~na), Israel Figueirido-Arnoso (University of Coru~na), Yolanda Vidal-'Insua (Complejo Hospitalario)



This paper presents a comprehensive study on the evaluation of explanatory capabilities of machine learning models, with a focus on Decision Trees, Random Forest and XGBoost models using a pancreatic cancer dataset. We use Human-in-the-Loop related techniques and medical guidelines as a source of domain knowledge to establish the importance of the different features that are relevant to establish a pancreatic cancer treatment. These features are not only used as a dimensionality reduction approach for the machine learning models, but also as way to evaluate the explainability capabilities of the different models using agnostic and non-agnostic explainability techniques. To facilitate interpretation of explanatory results, we propose the use of similarity measures such as the Weighted Jaccard Similarity coefficient. The goal is to not only select the best performing model but also the one that can best explain its conclusions and aligns with human domain knowledge.

Get summaries of the top AI research delivered straight to your inbox:


This section introduces Explainable AI (XAI) and discusses its importance in making AI systems more understandable to humans. It highlights the advantages of XAI, such as fostering confidence in model predictions, promoting responsible AI development, aiding in debugging, and allowing for auditing of AI models.

The section describes the evolution of AI systems from symbolic AI models, which were inherently explainable, to machine learning models that can learn from data but often lack interpretability. It explains the trade-off between model interpretability and performance, with more interpretable models tending to perform less well than less interpretable models.

The authors mention the development of "agnostic" techniques to include explainability in opaque models but point out their limitations in terms of reliability and computational efficiency. They emphasize the need to validate the explanatory capabilities of these models, especially in complex domains like medical environments.

The paper focuses on solving a problem related to pancreatic cancer treatment using classical machine learning models such as Decision Trees, Random Forest, and XGBoost. The authors aim to validate the explainability of these models by comparing different explainability methods and assessing their medical significance through collaboration with human experts and the use of medical guidelines.

The contribution of the paper lies in the development of different ways to evaluate explainability models based on human expert collaboration and domain-specific guidelines, applied to the pancreatic cancer treatment problem.

State of the art

The provided text discusses explainability in AI, focusing on its importance in the healthcare domain, particularly for pancreatic cancer diagnosis and treatment. Key points:

  • Explainable AI techniques help humans understand, trust, and manage machine learning systems. Explainability goes beyond interpretability.
  • Explainability methods can be classified by scope (local or global) and generality (model-agnostic or model-specific).
  • Building trust is crucial when creating AI systems for collaboration with humans, especially in critical domains like healthcare.
  • Examples of explainability techniques applied to medical ML models are provided, with SHAP and GradCAM being the most popular.
  • Problems with explainability methods are discussed, including inconsistency, instability, computational inefficiency, and the need for domain experts to evaluate explanations.
  • Pancreatic cancer staging is explained, using the TNM system and classification into stages 0, I, II, III, and IV.
  • The NCCN Clinical Practice Guidelines for Pancreatic Adenocarcinoma are introduced, organized into processes (PANCs) for different stages of diagnosis and treatment.
  • The guidelines establish four main diagnostic groups: resectable, borderline resectable, locally advanced, and metastatic disease, which guide treatment decisions.

Data and methodology

The dataset used in this work was obtained from The Cancer Genome Atlas Program and consists of 181 pancreatic cancer cases, of which 117 received chemotherapy treatment. Each case has 158 features ranging from family history data to treatment and follow-up information.

A meticulous feature selection process was performed, involving data pruning, eliminating redundant information, and asking a panel of oncologists to select relevant features for treatment selection. This resulted in 27 features and the target therapy type variable.

Three different feature sets were created:

  1. Recommended set: 14 features rated as highly relevant or relevant by medical experts
  2. Maximum set: All 27 features, including those considered barely relevant
  3. Minimum set: 5 features (Age, T, N, M, Stage) achieved through dimensionality reduction

The medical guidelines from "Pancreatic Adenocarcinoma - NCCN Clinical Practice Guidelines in Oncology" were analyzed to evaluate feature importance regarding chemotherapy treatment. A simplified graph of the diagnostic decisions involving chemotherapy was created.

Three types of machine learning models were chosen for comparison: Decision Tree, Random Forest, and XGBoost. Their advantages and disadvantages in terms of accuracy and understandability are discussed.

To enhance the interpretability of the ML models, explainability techniques were employed:

  • Model-specific methods: Mean Decrease in Impurity (MDI) and Mean Decrease Accuracy (MDA) for tree-based models
  • Model-agnostic methods: SHapley Additive exPlanations (SHAP) and Locally Interpretable Model-agnostic Explanations (LIME)


The results of the different machine learning models are presented in terms of accuracy and interpretability, using the three feature sets discussed in section 3.2. The models are compared with medical guidelines and expert criteria.

In terms of accuracy, the models built with the minimum set of features performed best, suggesting that adding more features does not always improve accuracy and can make models more complex. Decision trees were generally more accurate, with results close to XGBoost. The small training dataset size may explain why the more complex XGBoost did not outperform simpler models.

For interpretability, various explainable AI methods were used to extract feature importance and compare it to medical guidelines and expert opinion. With the minimum feature set, pathologic stage, age and pathologic N were consistently identified as the most important features across methods.

Weighted Jaccard similarity was used to quantitatively compare feature importance rankings between XAI methods, guidelines and experts. Similarity was high between guidelines and experts. For the recommended feature set, decision trees considered fewer features than experts and guidelines, while XGBoost spread importance more evenly, leading to higher similarity with expert opinion.

Overall, the results show that accurate and interpretable models can be built with a small set of key clinical features. The interpretations produced by XAI methods generally align with expert knowledge, though some differences exist. Simpler models like decision trees may be preferable for easier interpretability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Enhancing Deep Learning Model Explainability in Brain Tumor Datasets using Post-Heuristic Approaches

Konstantinos Pasvantis, Eftychios Protopapadakis





The application of deep learning models in medical diagnosis has showcased considerable efficacy in recent years. Nevertheless, a notable limitation involves the inherent lack of explainability during decision-making processes. This study addresses such a constraint, by enhancing the interpretability robustness. The primary focus is directed towards refining the explanations generated by the LIME Library and LIME image explainer. This is achieved throuhg post-processing mechanisms, based on scenario-specific rules. Multiple experiments have been conducted using publicly accessible datasets related to brain tumor detection. Our proposed post-heuristic approach demonstrates significant advancements, yielding more robust and concrete results, in the context of medical diagnosis.

Read more



Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI

Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia





Breast cancer has rapidly increased in prevalence in recent years, making it one of the leading causes of mortality worldwide. Among all cancers, it is by far the most common. Diagnosing this illness manually requires significant time and expertise. Since detecting breast cancer is a time-consuming process, preventing its further spread can be aided by creating machine-based forecasts. Machine learning and Explainable AI are crucial in classification as they not only provide accurate predictions but also offer insights into how the model arrives at its decisions, aiding in the understanding and trustworthiness of the classification results. In this study, we evaluate and compare the classification accuracy, precision, recall, and F-1 scores of five different machine learning methods using a primary dataset (500 patients from Dhaka Medical College Hospital). Five different supervised machine learning techniques, including decision tree, random forest, logistic regression, naive bayes, and XGBoost, have been used to achieve optimal results on our dataset. Additionally, this study applied SHAP analysis to the XGBoost model to interpret the model's predictions and understand the impact of each feature on the model's output. We compared the accuracy with which several algorithms classified the data, as well as contrasted with other literature in this field. After final evaluation, this study found that XGBoost achieved the best model accuracy, which is 97%.

Read more


Explainable AI Integrated Feature Engineering for Wildfire Prediction

Explainable AI Integrated Feature Engineering for Wildfire Prediction

Di Fan, Ayan Biswas, James Paul Ahrens





Wildfires present intricate challenges for prediction, necessitating the use of sophisticated machine learning techniques for effective modelingcite{jain2020review}. In our research, we conducted a thorough assessment of various machine learning algorithms for both classification and regression tasks relevant to predicting wildfires. We found that for classifying different types or stages of wildfires, the XGBoost model outperformed others in terms of accuracy and robustness. Meanwhile, the Random Forest regression model showed superior results in predicting the extent of wildfire-affected areas, excelling in both prediction error and explained variance. Additionally, we developed a hybrid neural network model that integrates numerical data and image information for simultaneous classification and regression. To gain deeper insights into the decision-making processes of these models and identify key contributing features, we utilized eXplainable Artificial Intelligence (XAI) techniques, including TreeSHAP, LIME, Partial Dependence Plots (PDP), and Gradient-weighted Class Activation Mapping (Grad-CAM). These interpretability tools shed light on the significance and interplay of various features, highlighting the complex factors influencing wildfire predictions. Our study not only demonstrates the effectiveness of specific machine learning models in wildfire-related tasks but also underscores the critical role of model transparency and interpretability in environmental science applications.

Read more


Evaluating the Explainability of Attributes and Prototypes for a Medical Classification Model

Evaluating the Explainability of Attributes and Prototypes for a Medical Classification Model

Luisa Gall'ee, Catharina Silvia Lisson, Christoph Gerhard Lisson, Daniela Drees, Felix Weig, Daniel Vogele, Meinrad Beer, Michael Gotz





Due to the sensitive nature of medicine, it is particularly important and highly demanded that AI methods are explainable. This need has been recognised and there is great research interest in xAI solutions with medical applications. However, there is a lack of user-centred evaluation regarding the actual impact of the explanations. We evaluate attribute- and prototype-based explanations with the Proto-Caps model. This xAI model reasons the target classification with human-defined visual features of the target object in the form of scores and attribute-specific prototypes. The model thus provides a multimodal explanation that is intuitively understandable to humans thanks to predefined attributes. A user study involving six radiologists shows that the explanations are subjectivly perceived as helpful, as they reflect their decision-making process. The results of the model are considered a second opinion that radiologists can discuss using the model's explanations. However, it was shown that the inclusion and increased magnitude of model explanations objectively can increase confidence in the model's predictions when the model is incorrect. We can conclude that attribute scores and visual prototypes enhance confidence in the model. However, additional development and repeated user studies are needed to tailor the explanation to the respective use case.

Read more
