Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Read original: arXiv:2408.01959 - Published 8/29/2024 by Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

🤖

Overview

Multimodal AI models that can associate images and text have many potential applications, but there are concerns about biases in these models.
This paper examines whether CLIP vision-language models exhibit human-like biases in perceiving facial impressions.
The researchers found that CLIP models do reflect social biases, and the degree of bias is correlated with the size of the dataset used to train the models.

Plain English Explanation

Multimodal AI models that can link images and text have a lot of promise for tasks like automatically describing images or making digital content more accessible for blind and low-vision users. However, there are concerns that these models may pick up on and reflect social biases present in the data they're trained on.

In this study, the researchers examined 43 different CLIP vision-language models to see if they exhibit biases in how they perceive facial impressions, like perceptions of trustworthiness or sexuality. They found that these models do tend to reflect human-like biases, and the degree of bias is related to the size of the dataset used to train the model. Models trained on larger, less curated datasets showed more subtle social biases, like impressions of traits that aren't visually obvious.

The researchers also showed that these facial impression biases can carry over to other AI models, like the Stable Diffusion text-to-image generation model, and that these biases can intersect with racial biases. While these pre-trained CLIP models may be useful for studying bias, the researchers suggest that significant dataset curation would be needed to use them as general-purpose models without these biases.

Technical Explanation

The researchers studied 43 CLIP (Contrastive Language-Image Pre-training) vision-language models to investigate whether they exhibit human-like biases in perceived facial impressions. CLIP models are trained to associate images and text, and have shown impressive performance on a variety of tasks.

The researchers used a hierarchical clustering approach to analyze the underlying structure of facial impression biases in the CLIP models. They found that the degree to which a bias is shared across a society (i.e., the degree of cultural consensus) predicted the degree to which that bias was reflected in the CLIP models.

Interestingly, the researchers found that impressions of visually unobservable attributes, like trustworthiness and sexuality, only emerged in models trained on the largest dataset. This suggests that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases.

Additionally, the researchers showed that Stable Diffusion models, which use CLIP as a text encoder, also learn facial impression biases. These biases were found to intersect with racial biases in the Stable Diffusion XL-Turbo model.

Critical Analysis

The researchers provide important insights into the biases present in multimodal AI models like CLIP. By demonstrating that the degree of bias is correlated with dataset size, the study highlights the tradeoffs involved in training these models on increasingly large and diverse datasets.

One limitation of the study is that it focuses primarily on facial impression biases, which may not capture the full range of biases present in these models. Additionally, the researchers note that the CLIP models they studied were not intended for use as general-purpose models, and significant dataset curation would be required to reduce their biases.

It would be valuable for future research to explore the generalizability of these findings to other multimodal AI models and to investigate strategies for mitigating biases, such as dataset refinement or adversarial debiasing. Additionally, the intersection of racial biases with facial impression biases is an important area for further study.

Conclusion

This study provides valuable insights into the biases present in multimodal AI models like CLIP, which have the potential to be widely used in a variety of applications. The researchers found that these models reflect human-like biases in facial impressions, and that the degree of bias is correlated with the size of the dataset used for training.

While these pre-trained CLIP models may be useful for research on bias, the study suggests that significant dataset curation would be required to use them as general-purpose models without these biases. As multimodal AI continues to advance, it will be crucial to address these biases to ensure the fair and responsible deployment of these technologies, such as in the synthesis of diverse counterfactuals.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤖

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.

8/29/2024

Social perception of faces in a vision-language model

Carina I. Hausladen, Manuel Knott, Colin F. Camerer, Pietro Perona

We explore social perception of human faces in CLIP, a widely used open-source vision-language model. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP's social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-language model.

8/27/2024

🔍

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Haocheng Dai, Sarang Joshi

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.

5/24/2024

A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

Ashutosh Sathe, Prachi Jain, Sunayana Sitaram

Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.

6/18/2024