FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Read original: arXiv:2404.16123 - Published 4/26/2024 by Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

🗣️

Overview

Recent research has shown that pruning or removing duplicate data from large image-caption datasets can significantly reduce the cost of training powerful Vision-Language Pretrained (VLP) models without sacrificing performance.
However, these datasets are known to contain harmful social biases, which can then get encoded into the trained models.
This paper evaluates how deduplication affects the prevalence of these biases and introduces a new algorithm called FairDeDup to mitigate the negative effects.

Plain English Explanation

Training large VLP models like CLIP is very computationally expensive. Recent research has shown that by removing duplicate data from the training datasets, you can dramatically reduce the cost of training these models without significantly impacting their performance.

However, the datasets commonly used to train these models, like images and captions scraped from the web, often contain harmful biases. These biases can then get reflected in the final trained models, which is problematic.

This paper looks at how the deduplication process affects the prevalence of these biases. It also introduces a new algorithm called FairDeDup that can reduce the negative impacts of deduplication on model fairness, while still maintaining the performance benefits.

Technical Explanation

The researchers evaluated how dataset deduplication affects the social biases present in CLIP-style models trained on the large LAION-400M dataset. They found that the recently proposed SemDeDup deduplication algorithm can lead to increased bias in the resulting models.

To address this, they introduced FairDeDup, a modification to SemDeDup that aims to preserve fairness during the deduplication process. When training CLIP models on deduplicated versions of LAION-400M, they found that FairDeDup consistently led to improved fairness metrics on benchmark datasets like FairFace and FACET, compared to SemDeDup, while maintaining zero-shot performance on standard CLIP evaluation tasks.

Critical Analysis

The paper provides a valuable contribution by highlighting an important tradeoff between dataset deduplication and model fairness. The authors acknowledge that their proposed FairDeDup algorithm is a relatively simple modification, and there may be more sophisticated approaches to this problem.

Additionally, the evaluation is limited to a specific CLIP-style architecture and a few fairness benchmarks. It would be interesting to see how these findings generalize to other VLP model architectures and a wider range of fairness metrics and downstream tasks.

The paper also does not deeply explore the underlying reasons why deduplication can exacerbate biases. Further research into the mechanisms at play could lead to more principled solutions.

Conclusion

This paper demonstrates that while dataset deduplication can significantly reduce the computational cost of training large VLP models, it can also amplify harmful social biases present in the original data. The authors introduce a simple modification called FairDeDup that helps mitigate this issue, improving model fairness without sacrificing performance.

This work highlights the importance of carefully considering the fairness implications when developing techniques to optimize machine learning dataset and model development. As the field of AI continues to advance, maintaining a focus on responsible and ethical practices will be crucial.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle

Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.

4/26/2024

FairCLIP: Harnessing Fairness in Vision-Language Learning

Yan Luo, Min Shi, Muhammad Osama Khan, Muhammad Muneeb Afzal, Hao Huang, Shuaihang Yuan, Yu Tian, Luo Song, Ava Kouhana, Tobias Elze, Yi Fang, Mengyu Wang

Fairness is a critical concern in deep learning, especially in healthcare, where these models influence diagnoses and treatment decisions. Although fairness has been investigated in the vision-only domain, the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets for studying fairness. To bridge this research gap, we introduce the first fair vision-language medical dataset Harvard-FairVLMed that provides detailed demographic attributes, ground-truth labels, and clinical notes to facilitate an in-depth examination of fairness within VL foundation models. Using Harvard-FairVLMed, we conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2), pre-trained on both natural and medical domains, across four different protected attributes. Our results highlight significant biases in all VL models, with Asian, Male, Non-Hispanic, and Spanish being the preferred subgroups across the protected attributes of race, gender, ethnicity, and language, respectively. In order to alleviate these biases, we propose FairCLIP, an optimal-transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. As the first VL dataset of its kind, Harvard-FairVLMed holds the potential to catalyze advancements in the development of machine learning models that are both ethically aware and clinically effective. Our dataset and code are available at https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k.

4/9/2024

🤖

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction of increasingly subtle social biases. Moreover, we use a hierarchical clustering approach to show that dataset size predicts the extent to which the underlying structure of facial impression bias resembles that of facial impression bias in humans. Finally, we show that Stable Diffusion models employing CLIP as a text encoder learn facial impression biases, and that these biases intersect with racial biases in Stable Diffusion XL-Turbo. While pretrained CLIP models may prove useful for scientific studies of bias, they will also require significant dataset curation when intended for use as general-purpose models in a zero-shot setting.

8/29/2024

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

Nan He, Weichen Xiong, Hanwen Liu, Yi Liao, Lei Ding, Kai Zhang, Guohua Tang, Xiao Han, Wei Yang

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable information and neglects the varying degrees of duplication. To address this, we propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness. Central to our approach is the concept of data commonness, a metric we introduce to quantify the degree of duplication by measuring the occurrence probabilities of samples using an n-gram model. Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps. Additionally, it enhances average few-shot downstream accuracy by 1.77% when trained for an equivalent duration. Importantly, this approach consistently improves performance, even on rigorously deduplicated datasets, indicating its potential to complement existing methods and become a standard pre-training process for LLMs.

7/10/2024