Distribution-aware Fairness Test Generation

2305.13935

Published 5/15/2024 by Sai Sathiesh Rajan, Ezekiel Soremekun, Yves Le Traon, Sudipta Chattopadhyay

🛸

Abstract

Ensuring that all classes of objects are detected with equal accuracy is essential in AI systems. For instance, being unable to identify any one class of objects could have fatal consequences in autonomous driving systems. Hence, ensuring the reliability of image recognition systems is crucial. This work addresses how to validate group fairness in image recognition software. We propose a distribution-aware fairness testing approach (called DistroFair) that systematically exposes class-level fairness violations in image classifiers via a synergistic combination of out-of-distribution (OOD) testing and semantic-preserving image mutation. DistroFair automatically learns the distribution (e.g., number/orientation) of objects in a set of images. Then it systematically mutates objects in the images to become OOD using three semantic-preserving image mutations - object deletion, object insertion and object rotation. We evaluate DistroFair using two well-known datasets (CityScapes and MS-COCO) and three major, commercial image recognition software (namely, Amazon Rekognition, Google Cloud Vision and Azure Computer Vision). Results show that about 21% of images generated by DistroFair reveal class-level fairness violations using either ground truth or metamorphic oracles. DistroFair is up to 2.3x more effective than two main baselines, i.e., (a) an approach which focuses on generating images only within the distribution (ID) and (b) fairness analysis using only the original image dataset. We further observed that DistroFair is efficient, it generates 460 images per hour, on average. Finally, we evaluate the semantic validity of our approach via a user study with 81 participants, using 30 real images and 30 corresponding mutated images generated by DistroFair. We found that images generated by DistroFair are 80% as realistic as real-world images.

Create account to get full access

Overview

This work addresses the challenge of ensuring the reliability and fairness of image recognition systems, which is crucial for applications like autonomous driving.
The authors propose a technique called DistroFair that systematically identifies class-level fairness violations in image classifiers.
DistroFair combines out-of-distribution (OOD) testing and semantic-preserving image mutations to expose biases in how well different object classes are detected.
The method is evaluated on two popular datasets and three major commercial image recognition APIs, revealing fairness issues in about 21% of the generated test cases.

Plain English Explanation

Image recognition systems, such as those used in autonomous vehicles, need to be able to accurately identify all types of objects, regardless of their characteristics. If a system consistently fails to detect certain classes of objects, it could have serious consequences.

The DistroFair approach tackles this problem by systematically testing image classifiers to uncover biases in how well they handle different object classes. It does this by taking a set of images, analyzing the distribution of objects within them, and then making strategic modifications to those objects. For example, the system might delete, insert, or rotate objects in the images in a way that preserves the overall meaning.

By subjecting the image classifier to these modified test cases, DistroFair can identify situations where the system performs poorly on certain object classes compared to others. This provides valuable insights into the fairness and reliability of the image recognition software.

The authors found that DistroFair was more effective at uncovering fairness issues than other baseline methods. It was also able to generate these test cases efficiently, creating hundreds of images per hour. And importantly, the user study showed that the modified images generated by DistroFair were still perceived as realistic by human observers.

Technical Explanation

The key innovation of DistroFair is its synergistic combination of out-of-distribution (OOD) testing and semantic-preserving image mutations to systematically expose class-level fairness violations in image classifiers.

First, DistroFair analyzes the distribution of objects (e.g., their number, orientation) in a set of training images. It then uses three types of semantic-preserving mutations to generate OOD test cases: object deletion, object insertion, and object rotation. These mutations alter the images in a way that preserves the overall meaning and context, but pushes the data outside of the original training distribution.

By subjecting the image classifier to these mutated test cases, DistroFair can identify situations where the system performs poorly on certain object classes compared to others. This reveals potential fairness issues in the underlying model.

The authors evaluated DistroFair using two popular datasets (CityScapes and MS-COCO) and three major commercial image recognition APIs (Amazon Rekognition, Google Cloud Vision, and Azure Computer Vision). Their results show that about 21% of the images generated by DistroFair exposed class-level fairness violations, outperforming two baseline approaches by up to 2.3x.

Additionally, the authors conducted a user study to assess the semantic validity of the generated test cases. They found that the mutated images were perceived as 80% as realistic as the original real-world images, indicating that DistroFair can create highly plausible test cases for evaluating image classifiers.

Critical Analysis

The DistroFair approach provides a valuable contribution to the field of algorithmic fairness and the development of reliable image recognition systems. By focusing on class-level fairness, the method addresses an important aspect of model performance that is often overlooked.

However, the paper does acknowledge some limitations. First, the mutations used by DistroFair, while semantically preserving, may not capture all possible real-world distributional shifts that an image classifier might encounter. Additional research is needed to benchmark the fairness of image upsampling methods and develop more comprehensive out-of-distribution detection techniques.

Furthermore, the user study evaluating the realism of the mutated images was relatively small in scale. Expanding this evaluation with a larger and more diverse set of participants could provide further insights into the validity of the DistroFair approach.

It would also be interesting to see how DistroFair performs on more specialized image recognition tasks, such as equitable deep learning for eye disease screening, where fairness is of critical importance.

Conclusion

The DistroFair technique represents an important step forward in ensuring the reliability and fairness of image recognition systems. By systematically exposing class-level biases through a combination of OOD testing and semantic-preserving mutations, the method provides a valuable tool for developers and researchers to assess the performance of their models across diverse object classes.

As image recognition becomes increasingly prevalent in safety-critical applications, the need for robust and equitable systems is paramount. The insights provided by DistroFair can help drive the development of more reliable and fair AI models, ultimately benefiting both users and society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🏷️

Increasing Fairness in Classification of Out of Distribution Data for Facial Recognition

Gianluca Barone, Aashrit Cunchala, Rudy Nunez

Standard classification theory assumes that the distribution of images in the test and training sets are identical. Unfortunately, real-life scenarios typically feature unseen data (out-of-distribution data) which is different from data in the training distribution(in-distribution). This issue is most prevalent in social justice problems where data from under-represented groups may appear in the test data without representing an equal proportion of the training data. This may result in a model returning confidently wrong decisions and predictions. We are interested in the following question: Can the performance of a neural network improve on facial images of out-of-distribution data when it is trained simultaneously on multiple datasets of in-distribution data? We approach this problem by incorporating the Outlier Exposure model and investigate how the model's performance changes when other datasets of facial images were implemented. We observe that the accuracy and other metrics of the model can be increased by applying Outlier Exposure, incorporating a trainable weight parameter to increase the machine's emphasis on outlier images, and by re-weighting the importance of different class labels. We also experimented with whether sorting the images and determining outliers via image features would have more of an effect on the metrics than sorting by average pixel value. Our goal was to make models not only more accurate but also more fair by scanning a more expanded range of images. We also tested the datasets in reverse order to see whether a more fair dataset with balanced features has an effect on the model's accuracy.

6/26/2024

cs.CV cs.CY cs.LG

Toward Fairer Face Recognition Datasets

Alexandre Fournier-Mongieux, Michael Soumm, Adrian Popescu, Bertrand Luvison, Herv'e Le Borgne

Face recognition and verification are two computer vision tasks whose performance has progressed with the introduction of deep representations. However, ethical, legal, and technical challenges due to the sensitive character of face data and biases in real training datasets hinder their development. Generative AI addresses privacy by creating fictitious identities, but fairness problems persist. We promote fairness by introducing a demographic attributes balancing mechanism in generated training datasets. We experiment with an existing real dataset, three generated training datasets, and the balanced versions of a diffusion-based dataset. We propose a comprehensive evaluation that considers accuracy and fairness equally and includes a rigorous regression-based statistical analysis of attributes. The analysis shows that balancing reduces demographic unfairness. Also, a performance gap persists despite generation becoming more accurate with time. The proposed balancing method and comprehensive verification evaluation promote fairer and transparent face recognition and verification.

6/26/2024

cs.CV

Supervised Algorithmic Fairness in Distribution Shifts: A Survey

Minglai Shao, Dong Li, Chen Zhao, Xintao Wu, Yujie Lin, Qin Tian

Supervised fairness-aware machine learning under distribution shifts is an emerging field that addresses the challenge of maintaining equitable and unbiased predictions when faced with changes in data distributions from source to target domains. In real-world applications, machine learning models are often trained on a specific dataset but deployed in environments where the data distribution may shift over time due to various factors. This shift can lead to unfair predictions, disproportionately affecting certain groups characterized by sensitive attributes, such as race and gender. In this survey, we provide a summary of various types of distribution shifts and comprehensively investigate existing methods based on these shifts, highlighting six commonly used approaches in the literature. Additionally, this survey lists publicly available datasets and evaluation metrics for empirical studies. We further explore the interconnection with related research fields, discuss the significant challenges, and identify potential directions for future studies.

5/7/2024

cs.LG cs.AI cs.CY

Benchmarking the Fairness of Image Upsampling Methods

Mike Laszkiewicz, Imant Daunhawer, Julia E. Vogt, Asja Fischer, Johannes Lederer

Recent years have witnessed a rapid development of deep generative models for creating synthetic media, such as images and videos. While the practical applications of these models in everyday tasks are enticing, it is crucial to assess the inherent risks regarding their fairness. In this work, we introduce a comprehensive framework for benchmarking the performance and fairness of conditional generative models. We develop a set of metrics$unicode{x2013}$inspired by their supervised fairness counterparts$unicode{x2013}$to evaluate the models on their fairness and diversity. Focusing on the specific application of image upsampling, we create a benchmark covering a wide variety of modern upsampling methods. As part of the benchmark, we introduce UnfairFace, a subset of FairFace that replicates the racial distribution of common large-scale face datasets. Our empirical study highlights the importance of using an unbiased training set and reveals variations in how the algorithms respond to dataset imbalances. Alarmingly, we find that none of the considered methods produces statistically fair and diverse results. All experiments can be reproduced using our provided repository.

5/1/2024

cs.CV cs.AI cs.LG