Evaluation of Bias Towards Medical Professionals in Large Language Models

Read original: arXiv:2407.12031 - Published 7/18/2024 by Xi Chen, Yang Xu, MingKe You, Li Wang, WeiZhi Liu, Jian Li

💬

Overview

This study examines whether large language models (LLMs) exhibit biases towards medical professionals.
Researchers created fictitious candidate resumes to control for identity factors while maintaining consistent qualifications.
They tested three LLMs (GPT-4, Claude-3-haiku, and Mistral-Large) using a standardized prompt to evaluate resumes for specific medical residency programs.
The study looked at both explicit bias (changing gender and race information) and implicit bias (changing names while hiding race and gender).
The results were compared to real-world data on the demographics of the medical workforce.

Plain English Explanation

The researchers wanted to see if the large language models they tested had any biases when evaluating resumes for medical residency programs. To do this, they created fake resumes with consistent qualifications but changed things like the candidate's gender and race.

They found that all three language models - GPT-4, Claude-3, and Mistral-Large - showed significant biases based on gender and race. For example, the models tended to favor male candidates for surgical and orthopedic roles, while preferring female candidates for fields like dermatology and pediatrics.

The models also exhibited preferences for certain racial groups in different specialties. GPT-4 seemed to favor Black and Hispanic candidates in some areas, while Claude-3 and Mistral-Large generally favored Asian candidates.

When the researchers compared the model's choices to real-world medical workforce data, they found the language models consistently selected a higher proportion of female and underrepresented racial candidates than are actually represented in the field.

This suggests these AI systems could perpetuate biases and undermine diversity in healthcare if used without proper safeguards. The findings highlight the need for thorough bias mitigation strategies when deploying large language models in high-stakes applications like clinical decision-making.

Technical Explanation

The researchers designed an experiment to test for both explicit and implicit biases in how three large language models (LLMs) - GPT-4, Claude-3-haiku, and Mistral-Large - evaluated fictional medical residency candidates.

They created a database of 900,000 simulated resumes with consistent qualifications but varied the candidates' gender and race. This allowed them to isolate the effect of these demographic factors on the LLMs' assessments.

To test for explicit bias, the researchers directly modified the gender and race information on the resumes. For implicit bias, they changed the names of the candidates while obscuring their race and gender.

The LLMs were then prompted to evaluate the resumes for specific medical specialties, such as surgery, pediatrics, and psychiatry. The model outputs were analyzed to identify any patterns of gender or racial preferences.

The results showed all three LLMs exhibited significant biases, favoring certain demographic groups over others across different medical fields. For example, the models tended to prefer male candidates for surgical and orthopedic roles, while selecting more female candidates for dermatology, family medicine, OB/GYN, pediatrics, and psychiatry.

The models also displayed racial biases, with Claude-3 and Mistral-Large generally favoring Asian candidates, while GPT-4 showed a preference for Black and Hispanic candidates in some specialties.

When compared to real-world data from the Association of American Medical Colleges, the language models consistently selected higher proportions of female and underrepresented racial candidates than their actual representation in the medical workforce.

Critical Analysis

The researchers acknowledge several limitations to their study. While the simulated resumes were designed to control for qualifications, there may have been other subtle differences that influenced the LLMs' assessments.

Additionally, the study only tested three specific language models, so the findings may not generalize to all LLMs or other types of AI systems used in clinical decision support. Further research is needed to understand how a broader range of models might exhibit biases.

The researchers also note that their analysis focused on explicit and implicit biases, but did not explore intersectional biases - the compounded effects of multiple demographic factors. Investigating these more complex biases could yield additional insights.

Overall, this study provides valuable evidence of the potential for large language models to perpetuate gender and racial biases in high-stakes medical decision-making. The findings underscore the importance of proactively addressing bias issues before deploying these systems in real-world healthcare settings.

Conclusion

This study reveals that prominent large language models, including GPT-4, Claude-3, and Mistral-Large, exhibit significant gender and racial biases when evaluating medical residency candidates. The models consistently favored certain demographic groups over others across different medical specialties, often not aligning with the actual representation in the healthcare workforce.

These findings highlight the risks of using language models for clinical decision support without robust bias mitigation strategies. If deployed without proper safeguards, such AI systems could undermine efforts to improve diversity and equity in the medical profession.

The researchers emphasize the need for continued scrutiny and development of techniques to identify and mitigate biases in large language models before they are used in high-stakes applications that impact people's lives. Ongoing vigilance and a commitment to ethical AI development will be crucial to ensure these powerful technologies benefit society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Evaluation of Bias Towards Medical Professionals in Large Language Models

Xi Chen, Yang Xu, MingKe You, Li Wang, WeiZhi Liu, Jian Li

This study evaluates whether large language models (LLMs) exhibit biases towards medical professionals. Fictitious candidate resumes were created to control for identity factors while maintaining consistent qualifications. Three LLMs (GPT-4, Claude-3-haiku, and Mistral-Large) were tested using a standardized prompt to evaluate resumes for specific residency programs. Explicit bias was tested by changing gender and race information, while implicit bias was tested by changing names while hiding race and gender. Physician data from the Association of American Medical Colleges was used to compare with real-world demographics. 900,000 resumes were evaluated. All LLMs exhibited significant gender and racial biases across medical specialties. Gender preferences varied, favoring male candidates in surgery and orthopedics, while preferring females in dermatology, family medicine, obstetrics and gynecology, pediatrics, and psychiatry. Claude-3 and Mistral-Large generally favored Asian candidates, while GPT-4 preferred Black and Hispanic candidates in several specialties. Tests revealed strong preferences towards Hispanic females and Asian males in various specialties. Compared to real-world data, LLMs consistently chose higher proportions of female and underrepresented racial candidates than their actual representation in the medical workforce. GPT-4, Claude-3, and Mistral-Large showed significant gender and racial biases when evaluating medical professionals for residency selection. These findings highlight the potential for LLMs to perpetuate biases and compromise healthcare workforce diversity if used without proper bias mitigation strategies.

7/18/2024

🌀

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Raphael Poulain, Hamed Fayyaz, Rahmatollah Beheshti

Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.

4/24/2024

The Silicone Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Lena Armstrong, Abbey Liu, Stephen MacNeil, Danae Metaxa

Large language models (LLMs) are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes. This study explores the potential impact of LLMs on hiring practices. To do so, we conduct an algorithm audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits. We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2). In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability). We find that the model reflects some biases based on stereotypes. In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates. When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences. Our findings contribute to a growing body of literature on LLM biases, in particular when used in workplace contexts.

5/13/2024

You Gotta be a Doctor, Lin: An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

Huy Nghiem, John Prindle, Jieyu Zhao, Hal Daum'e III

Social science research has shown that candidates with names indicative of certain races or genders often face discrimination in employment practices. Similarly, Large Language Models (LLMs) have demonstrated racial and gender biases in various applications. In this study, we utilize GPT-3.5-Turbo and Llama 3-70B-Instruct to simulate hiring decisions and salary recommendations for candidates with 320 first names that strongly signal their race and gender, across over 750,000 prompts. Our empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations. Additionally, even among candidates with identical qualifications, salary recommendations vary by as much as 5% between different subgroups. A comparison with real-world labor data reveals inconsistent alignment with U.S. labor market characteristics, underscoring the necessity of risk investigation of LLM-powered systems.

6/19/2024