Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval

Read original: arXiv:2407.20371 - Published 8/22/2024 by Kyra Wilson, Aylin Caliskan

Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval

Overview

This paper examines the potential for gender, race, and intersectional bias in resume screening using language models.
The researchers investigate how language models trained on job application data may exhibit biases that influence hiring decisions.
They propose a framework to audit language models for biases and test it on a large language model used for resume screening.

Plain English Explanation

The paper looks at whether language models used in resume screening can be biased against certain gender or racial groups. Language models are AI systems trained on large datasets of text, which can then be used to analyze new text, like resumes. The researchers develop a framework to test these models for biases, and apply it to a language model used for resume screening. Their findings suggest that these models may discriminate against certain groups in hiring decisions.

Technical Explanation

The paper proposes a framework to audit language models for gender, race, and intersectional biases in the context of resume screening. They evaluate a large language model used for resume retrieval and ranking, testing it on a dataset of synthetic resumes with varying gender and race attributes.

The framework involves several key steps:

Generating a set of synthetic resumes with controlled gender and race attributes.
Using the language model to retrieve and rank the resumes in response to job postings.
Analyzing the ranked lists to measure differences in model behavior based on the gender and race attributes.

The researchers find evidence of gender and racial biases in the language model's resume rankings, with certain demographic groups receiving systematically lower rankings for the same qualifications. They also observe interactions between gender and race, indicating the presence of intersectional biases.

Critical Analysis

The paper raises important concerns about the potential for algorithmic bias in hiring processes that rely on language models. While the authors acknowledge limitations in their synthetic dataset and the need for further research, their findings suggest that companies should carefully audit any AI systems used in recruiting and make concerted efforts to mitigate biases.

One potential limitation is the use of a single language model, as results may vary across different models and architectures. Additionally, the paper does not explore potential sources of bias or ways to address them. Further research could investigate techniques to debias language models or alternative approaches to resume screening that are less susceptible to demographic biases.

Conclusion

This paper highlights the need for increased scrutiny and accountability around the use of language models in high-stakes decision-making processes like hiring. The authors demonstrate a framework for identifying biases in these models, which can inform efforts to develop fairer and more inclusive AI systems for resume screening and other applications. As the use of AI in hiring grows, it is crucial that the technology is carefully audited and designed to mitigate biases that could perpetuate discrimination in the job market.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval

Kyra Wilson, Aylin Caliskan

Artificial intelligence (AI) hiring tools have revolutionized resume screening, and large language models (LLMs) have the potential to do the same. However, given the biases which are embedded within LLMs, it is unclear whether they can be used in this scenario without disadvantaging groups based on their protected attributes. In this work, we investigate the possibilities of using LLMs in a resume screening setting via a document retrieval framework that simulates job candidate selection. Using that framework, we then perform a resume audit study to determine whether a selection of Massive Text Embedding (MTE) models are biased in resume screening scenarios. We simulate this for nine occupations, using a collection of over 500 publicly available resumes and 500 job descriptions. We find that the MTEs are biased, significantly favoring White-associated names in 85.1% of cases and female-associated names in only 11.1% of cases, with a minority of cases showing no statistically significant differences. Further analyses show that Black males are disadvantaged in up to 100% of cases, replicating real-world patterns of bias in employment settings, and validate three hypotheses of intersectionality. We also find an impact of document length as well as the corpus frequency of names in the selection of resumes. These findings have implications for widely used AI tools that are automating employment, fairness, and tech policy.

8/22/2024

The Silicone Ceiling: Auditing GPT's Race and Gender Biases in Hiring

Lena Armstrong, Abbey Liu, Stephen MacNeil, Danae Metaxa

Large language models (LLMs) are increasingly being introduced in workplace settings, with the goals of improving efficiency and fairness. However, concerns have arisen regarding these models' potential to reflect or exacerbate social biases and stereotypes. This study explores the potential impact of LLMs on hiring practices. To do so, we conduct an algorithm audit of race and gender biases in one commonly-used LLM, OpenAI's GPT-3.5, taking inspiration from the history of traditional offline resume audits. We conduct two studies using names with varied race and gender connotations: resume assessment (Study 1) and resume generation (Study 2). In Study 1, we ask GPT to score resumes with 32 different names (4 names for each combination of the 2 gender and 4 racial groups) and two anonymous options across 10 occupations and 3 evaluation tasks (overall rating, willingness to interview, and hireability). We find that the model reflects some biases based on stereotypes. In Study 2, we prompt GPT to create resumes (10 for each name) for fictitious job candidates. When generating resumes, GPT reveals underlying biases; women's resumes had occupations with less experience, while Asian and Hispanic resumes had immigrant markers, such as non-native English and non-U.S. education and work experiences. Our findings contribute to a growing body of literature on LLM biases, in particular when used in workplace contexts.

5/13/2024

Do Large Language Models Discriminate in Hiring Decisions on the Basis of Race, Ethnicity, and Gender?

Haozhe An, Christabel Acquaye, Colin Wang, Zongxia Li, Rachel Rudinger

We examine whether large language models (LLMs) exhibit race- and gender-based name discrimination in hiring decisions, similar to classic findings in the social sciences (Bertrand and Mullainathan, 2004). We design a series of templatic prompts to LLMs to write an email to a named job applicant informing them of a hiring decision. By manipulating the applicant's first name, we measure the effect of perceived race, ethnicity, and gender on the probability that the LLM generates an acceptance or rejection email. We find that the hiring decisions of LLMs in many settings are more likely to favor White applicants over Hispanic applicants. In aggregate, the groups with the highest and lowest acceptance rates respectively are masculine White names and masculine Hispanic names. However, the comparative acceptance rates by group vary under different templatic settings, suggesting that LLMs' race- and gender-sensitivity may be idiosyncratic and prompt-sensitive.

6/18/2024

💬

Evaluation of Bias Towards Medical Professionals in Large Language Models

Xi Chen, Yang Xu, MingKe You, Li Wang, WeiZhi Liu, Jian Li

This study evaluates whether large language models (LLMs) exhibit biases towards medical professionals. Fictitious candidate resumes were created to control for identity factors while maintaining consistent qualifications. Three LLMs (GPT-4, Claude-3-haiku, and Mistral-Large) were tested using a standardized prompt to evaluate resumes for specific residency programs. Explicit bias was tested by changing gender and race information, while implicit bias was tested by changing names while hiding race and gender. Physician data from the Association of American Medical Colleges was used to compare with real-world demographics. 900,000 resumes were evaluated. All LLMs exhibited significant gender and racial biases across medical specialties. Gender preferences varied, favoring male candidates in surgery and orthopedics, while preferring females in dermatology, family medicine, obstetrics and gynecology, pediatrics, and psychiatry. Claude-3 and Mistral-Large generally favored Asian candidates, while GPT-4 preferred Black and Hispanic candidates in several specialties. Tests revealed strong preferences towards Hispanic females and Asian males in various specialties. Compared to real-world data, LLMs consistently chose higher proportions of female and underrepresented racial candidates than their actual representation in the medical workforce. GPT-4, Claude-3, and Mistral-Large showed significant gender and racial biases when evaluating medical professionals for residency selection. These findings highlight the potential for LLMs to perpetuate biases and compromise healthcare workforce diversity if used without proper bias mitigation strategies.

7/18/2024