On Biases in a UK Biobank-based Retinal Image Classification Model

Read original: arXiv:2408.02676 - Published 8/7/2024 by Anissa Alloula, Rima Mustafa, Daniel R McGowan, Bart{l}omiej W. Papie.z

On Biases in a UK Biobank-based Retinal Image Classification Model

Overview

This paper examines biases in a machine learning model developed using UK Biobank retinal images.
The researchers investigate whether the model exhibits biases based on factors like age, sex, and ethnicity.
They aim to identify and address potential biases to improve the model's fairness and performance across diverse patient populations.

Plain English Explanation

The paper focuses on a machine learning model that was trained to analyze retinal images from the UK Biobank database. Retinal images are pictures of the back of the eye, which can provide valuable information about a person's health.

The researchers wanted to see if this model had any biases or unfair tendencies when it came to analyzing images from people of different ages, sexes, or ethnic backgrounds. Biases in AI models can lead to inaccurate or unfair results, so the researchers set out to identify and address any issues.

By examining the model's performance on retinal images from diverse groups of people, the researchers hoped to pinpoint areas where the model might be less accurate or reliable. This could help them improve the model and ensure it works equally well for all patients, regardless of their personal characteristics.

Technical Explanation

The paper describes a machine learning model that was developed using retinal images from the UK Biobank, a large database of health information. The goal of the model was to analyze these retinal images and detect various medical conditions.

The researchers conducted a series of experiments to assess whether the model exhibited biases based on factors like age, sex, and ethnicity. They evaluated the model's performance on retinal images from different demographic groups and looked for discrepancies in accuracy or other metrics.

The results of the experiments revealed some concerning biases in the model's performance. For example, the model tended to be less accurate when analyzing images from older individuals or from certain ethnic minority groups. This suggests the model may not be equally reliable for all patients.

The researchers discuss potential causes of these biases, such as imbalances in the training data or inherent biases in the underlying algorithms. They also propose strategies for mitigating these biases, such as incorporating more diverse data during model training or adjusting the model architecture.

Critical Analysis

The paper provides a valuable investigation into the potential biases in a machine learning model for retinal image analysis. The researchers' systematic approach to evaluating the model's performance across different demographic groups is commendable and highlights an important issue in the development of AI systems for healthcare applications.

However, the paper does not delve deeply into the specific causes of the observed biases or provide a comprehensive solution. More research may be needed to fully understand the underlying factors contributing to the biases and to develop robust mitigation strategies.

Additionally, the paper focuses on a single model trained on the UK Biobank data, which may limit the generalizability of the findings. It would be valuable to explore biases in other retinal image analysis models or across multiple datasets to get a more comprehensive understanding of this issue.

Conclusion

This paper sheds light on the critical importance of addressing biases in machine learning models used in healthcare applications. The researchers' investigation of biases in a retinal image classification model highlights the need for thorough, systematic evaluation of AI systems to ensure they perform equitably for all patients.

By identifying and addressing these biases, the medical AI community can work towards developing more trustworthy and fair technologies that can benefit patients from diverse backgrounds. The insights from this paper can inform future research and development efforts in this important area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

On Biases in a UK Biobank-based Retinal Image Classification Model

Anissa Alloula, Rima Mustafa, Daniel R McGowan, Bart{l}omiej W. Papie.z

Recent work has uncovered alarming disparities in the performance of machine learning models in healthcare. In this study, we explore whether such disparities are present in the UK Biobank fundus retinal images by training and evaluating a disease classification model on these images. We assess possible disparities across various population groups and find substantial differences despite strong overall performance of the model. In particular, we discover unfair performance for certain assessment centres, which is surprising given the rigorous data standardisation protocol. We compare how these differences emerge and apply a range of existing bias mitigation methods to each one. A key insight is that each disparity has unique properties and responds differently to the mitigation methods. We also find that these methods are largely unable to enhance fairness, highlighting the need for better bias mitigation methods tailored to the specific type of bias.

8/7/2024

🌐

Towards objective and systematic evaluation of bias in artificial intelligence for medical imaging

Emma A. M. Stanley, Raissa Souza, Anthony Winder, Vedant Gulve, Kimberly Amador, Matthias Wilms, Nils D. Forkert

Artificial intelligence (AI) models trained using medical images for clinical tasks often exhibit bias in the form of disparities in performance between subgroups. Since not all sources of biases in real-world medical imaging data are easily identifiable, it is challenging to comprehensively assess how those biases are encoded in models, and how capable bias mitigation methods are at ameliorating performance disparities. In this article, we introduce a novel analysis framework for systematically and objectively investigating the impact of biases in medical images on AI models. We developed and tested this framework for conducting controlled in silico trials to assess bias in medical imaging AI using a tool for generating synthetic magnetic resonance images with known disease effects and sources of bias. The feasibility is showcased by using three counterfactual bias scenarios to measure the impact of simulated bias effects on a convolutional neural network (CNN) classifier and the efficacy of three bias mitigation strategies. The analysis revealed that the simulated biases resulted in expected subgroup performance disparities when the CNN was trained on the synthetic datasets. Moreover, reweighing was identified as the most successful bias mitigation strategy for this setup, and we demonstrated how explainable AI methods can aid in investigating the manifestation of bias in the model using this framework. Developing fair AI models is a considerable challenge given that many and often unknown sources of biases can be present in medical imaging datasets. In this work, we present a novel methodology to objectively study the impact of biases and mitigation strategies on deep learning pipelines, which can support the development of clinical AI that is robust and responsible.

7/2/2024

Reducing Biases towards Minoritized Populations in Medical Curricular Content via Artificial Intelligence for Fairer Health Outcomes

Chiman Salavati, Shannon Song, Willmar Sosa Diaz, Scott A. Hale, Roberto E. Montenegro, Fabricio Murai, Shiri Dori-Hacohen

Biased information (recently termed bisinformation) continues to be taught in medical curricula, often long after having been debunked. In this paper, we introduce BRICC, a firstin-class initiative that seeks to mitigate medical bisinformation using machine learning to systematically identify and flag text with potential biases, for subsequent review in an expert-in-the-loop fashion, thus greatly accelerating an otherwise labor-intensive process. A gold-standard BRICC dataset was developed throughout several years, and contains over 12K pages of instructional materials. Medical experts meticulously annotated these documents for bias according to comprehensive coding guidelines, emphasizing gender, sex, age, geography, ethnicity, and race. Using this labeled dataset, we trained, validated, and tested medical bias classifiers. We test three classifier approaches: a binary type-specific classifier, a general bias classifier; an ensemble combining bias type-specific classifiers independently-trained; and a multitask learning (MTL) model tasked with predicting both general and type-specific biases. While MTL led to some improvement on race bias detection in terms of F1-score, it did not outperform binary classifiers trained specifically on each task. On general bias detection, the binary classifier achieves up to 0.923 of AUC, a 27.8% improvement over the baseline. This work lays the foundations for debiasing medical curricula by exploring a novel dataset and evaluating different training model strategies. Hence, it offers new pathways for more nuanced and effective mitigation of bisinformation.

7/18/2024

An investigation into the causes of race bias in AI-based cine CMR segmentation

Tiarna Lee, Esther Puyol-Anton, Bram Ruijsink, Sebastien Roujol, Theodore Barfoot, Shaheim Ogbomo-Harmitt, Miaojing Shi, Andrew P. King

Artificial intelligence (AI) methods are being used increasingly for the automated segmentation of cine cardiac magnetic resonance (CMR) imaging. However, these methods have been shown to be subject to race bias, i.e. they exhibit different levels of performance for different races depending on the (im)balance of the data used to train the AI model. In this paper we investigate the source of this bias, seeking to understand its root cause(s) so that it can be effectively mitigated. We perform a series of classification and segmentation experiments on short-axis cine CMR images acquired from Black and White subjects from the UK Biobank and apply AI interpretability methods to understand the results. In the classification experiments, we found that race can be predicted with high accuracy from the images alone, but less accurately from ground truth segmentations, suggesting that the distributional shift between races, which is often the cause of AI bias, is mostly image-based rather than segmentation-based. The interpretability methods showed that most attention in the classification models was focused on non-heart regions, such as subcutaneous fat. Cropping the images tightly around the heart reduced classification accuracy to around chance level. Similarly, race can be predicted from the latent representations of a biased segmentation model, suggesting that race information is encoded in the model. Cropping images tightly around the heart reduced but did not eliminate segmentation bias. We also investigate the influence of possible confounders on the bias observed.

8/6/2024