The Bias of Harmful Label Associations in Vision-Language Models

2402.07329

Published 4/17/2024 by Caner Hazirbas, Alicia Sun, Yonathan Efroni, Mark Ibrahim

👁️

Abstract

Despite the remarkable performance of foundation vision-language models, the shared representation space for text and vision can also encode harmful label associations detrimental to fairness. While prior work has uncovered bias in vision-language models' (VLMs) classification performance across geography, work has been limited along the important axis of harmful label associations due to a lack of rich, labeled data. In this work, we investigate harmful label associations in the recently released Casual Conversations datasets containing more than 70,000 videos. We study bias in the frequency of harmful label associations across self-provided labels for age, gender, apparent skin tone, and physical adornments across several leading VLMs. We find that VLMs are $4-7$x more likely to harmfully classify individuals with darker skin tones. We also find scaling transformer encoder model size leads to higher confidence in harmful predictions. Finally, we find improvements on standard vision tasks across VLMs does not address disparities in harmful label associations.

Create account to get full access

Overview

Recent foundation vision-language models (VLMs) have shown remarkable performance, but their shared representation space can also encode harmful label associations.
Prior work has uncovered biases in VLMs' classification performance, but there has been limited investigation into harmful label associations due to a lack of rich, labeled data.
This paper investigates harmful label associations in the Casual Conversations dataset, which contains over 70,000 videos with self-provided labels for age, gender, skin tone, and physical adornments.

Plain English Explanation

Advanced AI models that can process both text and images have achieved impressive results. However, the way these models represent and combine visual and textual information can also lead to harmful biases. For example, the models may be more likely to associate certain attributes like skin tone with negative or stereotypical labels.

This research looked at a large dataset of over 70,000 videos where people self-reported details like their age, gender, and appearance. The researchers studied how well the leading vision-language models could correctly identify these attributes, and importantly, whether the models showed any harmful biases in how they associated certain attributes with negative labels.

The key findings are that the models were 4-7 times more likely to incorrectly classify individuals with darker skin tones, and that using larger and more powerful models actually increased the confidence in these harmful predictions. Importantly, the researchers found that improving the overall performance of the models on standard tasks did not address these disparities in harmful label associations.

Technical Explanation

This paper investigates the prevalence of harmful label associations in leading vision-language models (VLMs). The researchers leveraged the recently released Casual Conversations dataset, which contains over 70,000 videos with self-provided labels for attributes like age, gender, apparent skin tone, and physical adornments.

The team studied the frequency of harmful label associations across these attributes for several state-of-the-art VLM architectures. They found that the models were 4-7 times more likely to harmfully classify individuals with darker skin tones. Further, they discovered that scaling the transformer encoder model size led to higher confidence in these harmful predictions, despite improved performance on standard vision tasks.

These findings suggest that improving the overall accuracy of VLMs does not necessarily address the issue of harmful label associations encoded in the shared representation space. The researchers highlight the need for more rigorous evaluation of bias and fairness in these powerful multimodal models.

Critical Analysis

The research presented in this paper provides valuable insights into the problem of harmful label associations in vision-language models. The use of the Casual Conversations dataset, with its rich, self-reported attribute labels, allows for a more thorough investigation of these biases compared to prior work.

However, the paper does not delve into the potential causes or mechanisms underlying the observed disparities. Further research could explore the specific architectural choices, training data, or other factors that may contribute to these harmful associations.

Additionally, the paper focuses on a limited set of attributes (age, gender, skin tone, physical adornments) and does not address potential intersectional biases or other forms of harmful bias that may be present in these models. Expanding the scope of the analysis could provide a more comprehensive understanding of the problem.

Overall, this research highlights the importance of thorough bias and fairness evaluation in the development of powerful multimodal AI systems. The findings underscore the need for more robust techniques to mitigate harmful label associations and ensure the equitable application of these technologies.

Conclusion

This paper presents a comprehensive investigation into the prevalence of harmful label associations in leading vision-language models. The researchers leveraged the Casual Conversations dataset to study biases in how these models classify individuals based on attributes like age, gender, skin tone, and physical adornments.

The key finding is that the models are significantly more likely to incorrectly and harmfully classify individuals with darker skin tones, and that increasing the model size exacerbates this issue. Importantly, the researchers show that improving overall model performance does not address these disparities in harmful label associations.

This work underscores the critical need for thorough bias and fairness evaluation in the development of advanced multimodal AI systems. As these technologies become increasingly powerful and widespread, it is essential to ensure they are applied equitably and without perpetuating harmful stereotypes or biases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals

Phillip Howard, Kathleen C. Fraser, Anahita Bhiwandiwalla, Svetlana Kiritchenko

With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different models under this counterfactual generation setting at scale, producing over 57 million responses from popular LVLMs. Our multi-dimensional analysis reveals that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence the generation of toxic content, competency-associated words, harmful stereotypes, and numerical ratings of depicted individuals. We additionally explore the relationship between social bias in LVLMs and their corresponding LLMs, as well as inference-time strategies to mitigate bias.

5/31/2024

cs.CV

A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

Ashutosh Sathe, Prachi Jain, Sunayana Sitaram

Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.

6/18/2024

cs.CV cs.CL cs.CY

🔍

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Haocheng Dai, Sarang Joshi

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here.

5/24/2024

cs.CV cs.CL

Uncovering Bias in Large Vision-Language Models with Counterfactuals

Phillip Howard, Anahita Bhiwandiwalla, Kathleen C. Fraser, Svetlana Kiritchenko

With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different LVLMs under this counterfactual generation setting and find that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence toxicity and the generation of competency-associated words.

6/11/2024

cs.CV cs.AI