VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

2406.14194

Published 6/21/2024 by Jie Zhang, Sibo Wang, Xiangkui Cao, Zheng Yuan, Shiguang Shan, Xilin Chen, Wen Gao

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

Abstract

The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are tempered by the outputs that often reflect biases, a concern not yet extensively investigated. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a benchmark aimed at evaluating biases in LVLMs comprehensively. In VLBiasBench, we construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status). To create a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with different questions to form 128,342 samples. These questions are categorized into open and close ended types, fully considering the sources of bias and comprehensively evaluating the biases of LVLM from multiple perspectives. We subsequently conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.

Create account to get full access

Overview

This paper introduces VLBiasBench, a comprehensive benchmark for evaluating bias in large vision-language models.
VLBiasBench aims to provide a standardized way to assess biases in these powerful AI systems, which can have significant societal impacts.
The benchmark covers a wide range of bias types, including gender bias, racial bias, and other forms of social bias.

Plain English Explanation

VLBiasBench is a new tool that helps researchers and developers understand the biases present in large AI models that can see and understand language. These models, known as vision-language models, are incredibly powerful and can be used for all sorts of tasks like image captioning, visual question answering, and multi-modal reasoning.

However, like many AI systems, vision-language models can also pick up on and amplify societal biases related to gender, race, and other demographic factors. VLBiasBench aims to provide a standardized way to measure and assess these biases, so that model developers can identify and mitigate them.

The benchmark includes a wide range of test cases that probe for different types of bias, from stereotypical associations to counterfactual reasoning. By running vision-language models through this comprehensive suite of tests, researchers can get a clearer picture of the biases present and work to address them.

Technical Explanation

VLBiasBench is a novel benchmark designed to comprehensively evaluate biases in large vision-language models. These models are trained on vast amounts of data to perform tasks like image captioning, visual question answering, and multi-modal reasoning. However, they can also pick up on and amplify societal biases related to gender, race, and other demographic factors.

The benchmark includes a diverse set of test cases that probe for different types of bias, drawing on existing datasets and benchmarks like UVSB, DiffuSyn, and VLIND. It covers a range of bias categories, including stereotypical associations, intersectional biases, and counterfactual reasoning.

The benchmark is designed to provide a standardized way to assess and compare the biases present in different vision-language models. By running these models through the VLBiasBench test suite, researchers and developers can identify specific areas where biases exist and work to mitigate them through model fine-tuning, data augmentation, or other techniques.

Critical Analysis

The VLBiasBench paper provides a valuable contribution to the field by introducing a comprehensive benchmark for evaluating bias in vision-language models. The breadth of the test cases and the inclusion of established datasets like UVSB, DiffuSyn, and VLIND are particular strengths of the approach.

However, the paper acknowledges that the benchmark is not exhaustive and that there may be other forms of bias that are not captured by the current test cases. Additionally, the paper notes that the evaluation of certain types of bias, such as intersectional biases, can be challenging and may require further research and development.

It would also be interesting to see how the benchmark performs in assessing the biases of different types of vision-language models, including those trained using various data sources, model architectures, and fine-tuning approaches. This could provide insights into the factors that contribute to the development of biases in these systems.

Overall, VLBiasBench represents an important step forward in the effort to understand and mitigate biases in large, powerful AI models. As the field of AI continues to evolve, the development of robust, standardized evaluation tools like this will be crucial for ensuring that these systems are aligned with societal values and do not perpetuate harmful biases.

Conclusion

The VLBiasBench paper introduces a comprehensive benchmark for evaluating bias in large vision-language models. These powerful AI systems can have significant societal impacts, and it is crucial to understand and mitigate the biases they may pick up from the data they are trained on.

VLBiasBench provides a standardized way to assess a wide range of bias types, including gender, racial, and other forms of social bias. By running vision-language models through this benchmark, researchers and developers can identify specific areas where biases exist and work to address them.

While the benchmark is not exhaustive, it represents an important step forward in the effort to create more equitable and inclusive AI systems. As the field of AI continues to evolve, tools like VLBiasBench will be essential for ensuring that these powerful technologies are aligned with societal values and do not perpetuate harmful biases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!GenderBias-emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

Yisong Xiao, Aishan Liu, QianJia Cheng, Zhenfei Yin, Siyuan Liang, Jiapeng Li, Jing Shao, Xianglong Liu, Dacheng Tao

Large Vision-Language Models (LVLMs) have been widely adopted in various applications; however, they exhibit significant gender biases. Existing benchmarks primarily evaluate gender bias at the demographic group level, neglecting individual fairness, which emphasizes equal treatment of similar individuals. This research gap limits the detection of discriminatory behaviors, as individual fairness offers a more granular examination of biases that group fairness may overlook. For the first time, this paper introduces the GenderBias-emph{VL} benchmark to evaluate occupation-related gender bias in LVLMs using counterfactual visual questions under individual fairness criteria. To construct this benchmark, we first utilize text-to-image diffusion models to generate occupation images and their gender counterfactuals. Subsequently, we generate corresponding textual occupation options by identifying stereotyped occupation pairs with high semantic similarity but opposite gender proportions in real-world statistics. This method enables the creation of large-scale visual question counterfactuals to expose biases in LVLMs, applicable in both multimodal and unimodal contexts through modifying gender attributes in specific modalities. Overall, our GenderBias-emph{VL} benchmark comprises 34,581 visual question counterfactual pairs, covering 177 occupations. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs (eg, LLaVA) and state-of-the-art commercial APIs, including GPT-4o and Gemini-Pro. Our findings reveal widespread gender biases in existing LVLMs. Our benchmark offers: (1) a comprehensive dataset for occupation-related gender bias evaluation; (2) an up-to-date leaderboard on LVLM biases; and (3) a nuanced understanding of the biases presented by these models. footnote{The dataset and code are available at the href{https://genderbiasvl.github.io/}{website}.}

7/2/2024

cs.CV cs.AI

A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models

Ashutosh Sathe, Prachi Jain, Sunayana Sitaram

Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.

6/18/2024

cs.CV cs.CL cs.CY

Evaluating Large Vision-Language Models' Understanding of Real-World Complexities Through Synthetic Benchmarks

Haokun Zhou, Yipeng Hong

This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.

6/14/2024

cs.CV cs.AI

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Kang-il Lee, Minbeom Kim, Minsung Kim, Dongryeol Lee, Hyukhun Koh, Kyomin Jung

Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.

6/19/2024

cs.AI cs.CL cs.CV