Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two new measures of bias: LLM Implicit Bias, a prompt-based method for revealing implicit bias; and LLM Decision Bias, a strategy to detect subtle discrimination in decision-making tasks. Both measures are based on psychological research: LLM Implicit Bias adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Decision Bias operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). Our prompt-based LLM Implicit Bias measure correlates with existing language model embedding-based bias methods, but better predicts downstream behaviors measured by LLM Decision Bias. These new prompt-based measures draw from psychology's long history of research into measuring stereotype biases based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

  • Large language models (LLMs) can explicitly endorse egalitarian beliefs but still harbor subtle, implicit biases, similar to humans.
  • Measuring such implicit biases in LLMs can be challenging as they become increasingly proprietary and their internal representations may not be accessible.
  • Additionally, implicit biases are primarily a concern if they affect the actual decisions these systems make.

Plain English Explanation

Large language models (LLMs) are advanced AI systems that can understand and generate human-like text. While these models may explicitly express beliefs that support equality and fairness, they can still harbor subtle, unconscious biases, much like humans do. Measuring Implicit Bias in Large Language Models addresses two key challenges in assessing these implicit biases:

  1. Accessibility: As LLMs become more proprietary, it may not be possible to access their internal representations (known as "embeddings") and apply existing bias measurement techniques.
  2. Relevance: Implicit biases are only concerning if they actually influence the decisions and behaviors of these systems, not just their language.

To tackle these challenges, the researchers introduce two new measures:

  1. LLM Implicit Bias: A prompt-based approach inspired by the Implicit Association Test, which is widely used to study automatic associations in human minds.
  2. LLM Decision Bias: A strategy to detect subtle discrimination in the decision-making of these language models.

These new measures draw from psychological research and aim to reveal nuanced biases in LLMs that may not be detected by standard benchmarks.

Technical Explanation

The researchers introduce two new measures to assess implicit biases in large language models (LLMs):

  1. LLM Implicit Bias: This measure adapts the Implicit Association Test (IAT), a widely used psychological tool for studying automatic associations in human minds. The researchers developed prompt-based tasks that assess the strength of associations between concepts (e.g., race and weapons) in LLMs.

  2. LLM Decision Bias: This measure is based on psychological research indicating that relative evaluations between two candidates, rather than absolute evaluations of each, are more diagnostic of implicit biases. The researchers designed decision-making tasks to detect subtle discrimination in LLM behaviors.

Using these new measures, the researchers found pervasive stereotype biases in 8 value-aligned LLMs across 4 social categories (race, gender, religion, health) and 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). The prompt-based LLM Implicit Bias measure correlated with existing language model embedding-based bias methods, but better predicted the downstream behaviors measured by LLM Decision Bias.

Critical Analysis

The researchers acknowledge that as LLMs become increasingly proprietary, accessing their internal representations (embeddings) to apply existing bias measurement techniques may not be possible. The new measures they introduce, LLM Implicit Bias and LLM Decision Bias, address this challenge by relying on prompt-based approaches that only require observing the model's outputs.

However, one potential limitation of the LLM Decision Bias measure is that it may not capture the full range of biases that could influence an LLM's decision-making. The researchers focus on relative evaluations between candidates, but there may be other ways in which implicit biases could manifest in the decision-making process.

Additionally, while the researchers found pervasive stereotype biases in the LLMs they tested, it's important to note that the specific biases observed may be influenced by the training data and objectives used to develop these models. Beyond Performance: Quantifying and Mitigating Label Bias in NLP Models and Subtle Biases Need Subtler Measures: Dual Metrics for Application-Aligned Fairness discuss the importance of considering the intended application and context when assessing model biases.


This research introduces two novel measures, LLM Implicit Bias and LLM Decision Bias, to address the challenges of assessing implicit biases in large language models as they become increasingly proprietary. By drawing on psychological research, these measures aim to reveal nuanced biases that may not be detected by standard benchmarks.

The findings suggest that even value-aligned LLMs can harbor pervasive stereotype biases mirroring those in society. This highlights the importance of developing more comprehensive and context-specific methods for evaluating and mitigating biases in these powerful AI systems, as their decisions and behaviors can have significant real-world impacts. Reinforcement Learning from Reflection through Debates and Bias Patterns in Application of LLMs to Clinical Decision Support suggest potential avenues for further research in this area.

