Social perception of faces in a vision-language model

Read original: arXiv:2408.14435 - Published 8/27/2024 by Carina I. Hausladen, Manuel Knott, Colin F. Camerer, Pietro Perona

Social perception of faces in a vision-language model

Overview

This paper investigates how vision-language models perceive and represent social attributes of faces.
The researchers used a pre-trained vision-language model to extract facial features and study how the model associates those features with social perceptions.
They found that the model exhibits biases in its representations of facial features related to characteristics like gender, race, and age.

Plain English Explanation

The researchers wanted to understand how AI models that combine computer vision and language processing interpret and associate the facial features of people. They took a pre-trained model that can analyze images and generate text descriptions, and used it to study how the model represents different social attributes like gender, race, and age based on facial features.

The key finding was that the model exhibited biases in how it perceived and represented these social characteristics. For example, the model may associate certain facial features more strongly with perceptions of masculinity or femininity, or of youth or old age, in ways that reflect societal biases. This is an important issue to understand, as these types of biases can get encoded into the models and then get perpetuated when the models are used in real-world applications.

Technical Explanation

The researchers used a pre-trained CLIP (Contrastive Language-Image Pre-training) model, which is a vision-language model that can take an image as input and generate a text description. They extracted facial features from the images using this model, and then analyzed how the model associated those facial features with various social attributes.

Specifically, they looked at how the model represented gender, race, and age based on the facial features. They found that the model exhibited systematic biases in its representations. For example, certain facial features were more strongly associated with perceptions of masculinity or femininity, or with youth or old age, in ways that reflected societal stereotypes.

The researchers also explored approaches to mitigate these biases, such as using attribute-specific prototypes to refine the model's representations. This is an important area of research, as these types of biases in AI models can have real-world impacts when used in applications like facial analysis or image captioning.

Critical Analysis

The paper provides a thorough investigation of the social biases present in a state-of-the-art vision-language model. The researchers used a well-established model (CLIP) and a rigorous methodology to uncover the model's biased representations of facial features and social attributes.

One limitation of the study is that it focused on a single pre-trained model, CLIP. It would be valuable to extend the analysis to other vision-language models to see if the biases are consistent across different architectures and training approaches.

Additionally, the paper does not delve deeply into the potential real-world impacts of these biases. While the researchers mention the importance of addressing these issues, a more detailed discussion of the societal implications and ethical considerations would strengthen the critical analysis.

Overall, this is an important study that contributes to our understanding of the social biases present in advanced AI models. The findings highlight the need for continued research and development to create more fair and equitable AI systems.

Conclusion

This paper provides valuable insights into the social biases present in a state-of-the-art vision-language model. The researchers found that the model exhibited systematic biases in how it represented facial features in relation to social attributes like gender, race, and age. These biases reflect broader societal stereotypes and can have significant implications when such models are deployed in real-world applications.

The study underscores the importance of proactively addressing these issues in AI development, through techniques like bias mitigation and more inclusive data and model training. As AI systems become increasingly integrated into our lives, it is crucial that we develop them in a way that promotes fairness and equity, rather than perpetuating harmful biases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Social perception of faces in a vision-language model

Carina I. Hausladen, Manuel Knott, Colin F. Camerer, Pietro Perona

We explore social perception of human faces in CLIP, a widely used open-source vision-language model. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP's social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-language model.

8/27/2024

🤖

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.

8/29/2024

🤯

More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models

Messi H. J. Lee, Jacob M. Montgomery, Calvin K. Lai

Vision Language Models (VLMs), exemplified by GPT-4V, adeptly integrate text and vision modalities. This integration enhances Large Language Models' ability to mimic human perception, allowing them to process image inputs. Despite VLMs' advanced capabilities, however, there is a concern that VLMs inherit biases of both modalities in ways that make biases more pervasive and difficult to mitigate. Our study explores how VLMs perpetuate homogeneity bias and trait associations with regards to race and gender. When prompted to write stories based on images of human faces, GPT-4V describes subordinate racial and gender groups with greater homogeneity than dominant groups and relies on distinct, yet generally positive, stereotypes. Importantly, VLM stereotyping is driven by visual cues rather than group membership alone such that faces that are rated as more prototypically Black and feminine are subject to greater stereotyping. These findings suggest that VLMs may associate subtle visual cues related to racial and gender groups with stereotypes in ways that could be challenging to mitigate. We explore the underlying reasons behind this behavior and discuss its implications and emphasize the importance of addressing these biases as VLMs come to mirror human perception.

7/10/2024

🔍

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Haocheng Dai, Sarang Joshi

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.

5/24/2024