Fair Wasserstein Coresets

Read original: arXiv:2311.05436 - Published 6/5/2024 by Zikai Xiong, Niccol`o Dalmasso, Shubham Sharma, Freddy Lecue, Daniele Magazzeni, Vamsi K. Potluru, Tucker Balch, Manuela Veloso

⚙️

Overview

Data distillation and coresets are popular approaches to generate a smaller, representative set of samples for large-scale learning tasks
Machine learning is being increasingly applied to societal decision-making, making it crucial to address inherent biases in the data
Current approaches focus on creating fair synthetic samples, but their impact on downstream learning has not been fully explored

Plain English Explanation

When working with large datasets, researchers often use techniques like data distillation and coresets to create a smaller, representative subset of the data. This can make it easier and faster to use the data for machine learning tasks.

At the same time, machine learning is being used more and more to make decisions that affect society. This means it's important to make sure the data used doesn't have unfair biases against certain groups. Some researchers have tried to create fair synthetic data samples, but the impact of this on the final machine learning models hasn't been well-studied yet.

Technical Explanation

The paper introduces a new approach called "Fair Wasserstein Coresets" (FWC) that generates a fair, weighted set of synthetic samples to use in downstream learning tasks. FWC uses an efficient algorithm to minimize the distance between the original dataset and the weighted synthetic samples, while also ensuring demographic parity (fairness) across different groups in the data.

The paper shows that an unconstrained version of FWC is equivalent to the well-known Lloyd's algorithm for k-medians and k-means clustering. Experiments on both synthetic and real-world datasets demonstrate that FWC:

Achieves a good balance between fairness and performance in downstream models, compared to existing approaches
Improves fairness when added to the existing training data
Can be used to reduce biases in predictions from large language models like GPT-3.5 and GPT-4

Critical Analysis

The paper presents a novel and promising approach to generating fair, representative samples from large datasets. However, the authors acknowledge that their method relies on certain assumptions, such as knowing the sensitive attributes in the data a priori. In real-world scenarios, this information may not always be available or easily identifiable.

Additionally, the paper does not explore the scalability of the FWC algorithm for truly massive datasets. The performance and fairness trade-offs may need to be further investigated as the dataset size grows. The authors also mention that their method may not be suitable for all types of downstream tasks, and more research is needed to understand its broader applicability.

Conclusion

The Fair Wasserstein Coresets (FWC) approach introduced in this paper offers a way to generate fair, representative samples from large datasets for use in machine learning tasks. By optimizing for both accuracy and demographic parity, FWC has the potential to help address the growing concern of bias in AI systems as they are increasingly applied to high-stakes societal decisions. While the method has some limitations, it represents an important step forward in the ongoing effort to develop more equitable and inclusive machine learning techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

Fair Wasserstein Coresets

Zikai Xiong, Niccol`o Dalmasso, Shubham Sharma, Freddy Lecue, Daniele Magazzeni, Vamsi K. Potluru, Tucker Balch, Manuela Veloso

Data distillation and coresets have emerged as popular approaches to generate a smaller representative set of samples for downstream learning tasks to handle large-scale datasets. At the same time, machine learning is being increasingly applied to decision-making processes at a societal level, making it imperative for modelers to address inherent biases towards subgroups present in the data. While current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples, their impact on downstream learning processes has yet to be explored. In this work, we present fair Wasserstein coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC uses an efficient majority minimization algorithm to minimize the Wasserstein distance between the original dataset and the weighted synthetic samples while enforcing demographic parity. We show that an unconstrained version of FWC is equivalent to Lloyd's algorithm for k-medians and k-means clustering. Experiments conducted on both synthetic and real datasets show that FWC: (i) achieves a competitive fairness-utility tradeoff in downstream models compared to existing approaches, (ii) improves downstream fairness when added to the existing training data and (iii) can be used to reduce biases in predictions from large language models (GPT-3.5 and GPT-4).

6/5/2024

The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection

Mohammad Jafari, Yimeng Zhang, Yihua Zhang, Sijia Liu

As machine learning tasks continue to evolve, the trend has been to gather larger datasets and train increasingly larger models. While this has led to advancements in accuracy, it has also escalated computational costs to unsustainable levels. Addressing this, our work aims to strike a delicate balance between computational efficiency and model accuracy, a persisting challenge in the field. We introduce a novel method that employs core subset selection for reweighting, effectively optimizing both computational time and model performance. By focusing on a strategically selected coreset, our approach offers a robust representation, as it efficiently minimizes the influence of outliers. The re-calibrated weights are then mapped back to and propagated across the entire dataset. Our experimental results substantiate the effectiveness of this approach, underscoring its potential as a scalable and precise solution for model training.

6/3/2024

🏷️

No Dimensional Sampling Coresets for Classification

Meysam Alishahi, Jeff M. Phillips

We refine and generalize what is known about coresets for classification problems via the sensitivity sampling framework. Such coresets seek the smallest possible subsets of input data, so one can optimize a loss function on the coreset and ensure approximation guarantees with respect to the original data. Our analysis provides the first no dimensional coresets, so the size does not depend on the dimension. Moreover, our results are general, apply for distributional input and can use iid samples, so provide sample complexity bounds, and work for a variety of loss functions. A key tool we develop is a Radamacher complexity version of the main sensitivity sampling approach, which can be of independent interest.

7/24/2024

🧠

Bayesian Pseudo-Coresets via Contrastive Divergence

Piyush Tiwary, Kumar Shubham, Vivek V. Kashyap, Prathosh A. P

Bayesian methods provide an elegant framework for estimating parameter posteriors and quantification of uncertainty associated with probabilistic models. However, they often suffer from slow inference times. To address this challenge, Bayesian Pseudo-Coresets (BPC) have emerged as a promising solution. BPC methods aim to create a small synthetic dataset, known as pseudo-coresets, that approximates the posterior inference achieved with the original dataset. This approximation is achieved by optimizing a divergence measure between the true posterior and the pseudo-coreset posterior. Various divergence measures have been proposed for constructing pseudo-coresets, with forward Kullback-Leibler (KL) divergence being the most successful. However, using forward KL divergence necessitates sampling from the pseudo-coreset posterior, often accomplished through approximate Gaussian variational distributions. Alternatively, one could employ Markov Chain Monte Carlo (MCMC) methods for sampling, but this becomes challenging in high-dimensional parameter spaces due to slow mixing. In this study, we introduce a novel approach for constructing pseudo-coresets by utilizing contrastive divergence. Importantly, optimizing contrastive divergence eliminates the need for approximations in the pseudo-coreset construction process. Furthermore, it enables the use of finite-step MCMC methods, alleviating the requirement for extensive mixing to reach a stationary distribution. To validate our method's effectiveness, we conduct extensive experiments on multiple datasets, demonstrating its superiority over existing BPC techniques.

5/10/2024