Fair Overlap Number of Balls (Fair-ONB): A Data-Morphology-based Undersampling Method for Bias Reduction

Read original: arXiv:2407.14210 - Published 7/22/2024 by Jos'e Daniel Pascual-Triana, Alberto Fern'andez, Paulo Novais, Francisco Herrera

Fair Overlap Number of Balls (Fair-ONB): A Data-Morphology-based Undersampling Method for Bias Reduction

Overview

The paper proposes a new data undersampling method called "Fair Overlap Number of Balls (Fair-ONB)" to reduce bias in machine learning models.
The method uses data morphology to identify and remove redundant or overlapping data points that contribute to biased model outputs.
This allows for efficient bias reduction without significant loss of performance on the main task.

Plain English Explanation

The paper introduces a new technique called "Fair Overlap Number of Balls (Fair-ONB)" to help reduce unfair biases in machine learning models. Machine learning models can sometimes learn and amplify biases present in the training data, leading to discriminatory outcomes.

The Fair-ONB method works by analyzing the "morphology" or shape of the data points. It identifies data points that are very similar or "overlapping" with each other, and removes some of the redundant points. This helps balance out the representation of different groups in the data, reducing the biases that the model might pick up on.

By selectively removing certain data points in this way, the method can help make the machine learning model more fair and unbiased, without significantly degrading its overall performance on the main task. This is an important advancement, as many existing bias mitigation techniques often come with tradeoffs in terms of model accuracy or efficiency.

Technical Explanation

The Fair-ONB method is a data undersampling technique that aims to reduce unfair biases in machine learning models. It works by analyzing the "morphology" or geometric structure of the training data to identify and remove redundant or overlapping data points.

The key idea is that data points belonging to a dominant group (e.g. the majority class) may be overrepresented and "crowd out" data from minority groups, leading the model to learn biased representations. By selectively removing some of the majority group data points based on their similarity to neighbors, Fair-ONB can help balance the data distribution and mitigate these biases.

The method first computes an "Overlap Number of Balls" (ONB) metric for each data point, which quantifies how many other points are within a certain distance (the "ball" radius). It then uses this information to identify and remove data points with high ONB values, i.e. those that are very similar to many other points.

The authors show that Fair-ONB can effectively reduce disparate impact and demographic parity metrics on several benchmark datasets, while maintaining comparable or even improved predictive performance compared to the original unmitigated models. This suggests it is a promising approach for practical bias mitigation in real-world machine learning applications.

Critical Analysis

The Fair-ONB paper presents a novel and compelling method for reducing unfair biases in machine learning models. The key strength is its ability to achieve bias mitigation without significant loss of predictive performance, which is a common challenge with many existing fairness techniques.

That said, the paper does not extensively explore the limitations or potential downsides of the approach. For example, it is unclear how robust Fair-ONB is to different types of biases, dataset shifts, or model architectures. The authors also do not discuss how to set the critical "ball radius" hyperparameter in practice, which could significantly impact the method's effectiveness.

Additionally, the proposed technique is still a form of data preprocessing, which can have downstream effects that are not fully predictable. There may be cases where blindly removing data points, even if they are redundant, could inadvertently eliminate important information or minority signals that the model needs to learn fair representations.

Further research is needed to better understand the broader implications, edge cases, and failure modes of the Fair-ONB approach. Nonetheless, it represents an important step forward in the quest for more trustworthy and fair AI systems.

Conclusion

The Fair Overlap Number of Balls (Fair-ONB) method proposed in this paper offers a novel and promising approach for reducing unfair biases in machine learning models. By intelligently undersampling the training data based on its geometric structure, the technique can effectively mitigate demographic disparities without significantly compromising overall model performance.

As machine learning systems become increasingly pervasive, developing effective fairness-aware techniques like Fair-ONB will be crucial to ensuring these technologies are deployed equitably and benefit all members of society. While the current paper leaves some open questions, it represents an important step forward in the field of trustworthy and fair AI.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Fair Overlap Number of Balls (Fair-ONB): A Data-Morphology-based Undersampling Method for Bias Reduction

Jos'e Daniel Pascual-Triana, Alberto Fern'andez, Paulo Novais, Francisco Herrera

Given the magnitude of data generation currently, both in quantity and speed, the use of machine learning is increasingly important. When data include protected features that might give rise to discrimination, special care must be taken. Data quality is critical in these cases, as biases in training data can be reflected in classification models. This has devastating consequences and fails to comply with current regulations. Data-Centric Artificial Intelligence proposes dataset modifications to improve its quality. Instance selection via undersampling can foster balanced learning of classes and protected feature values in the classifier. When such undersampling is done close to the decision boundary, the effect on the classifier would be bolstered. This work proposes Fair Overlap Number of Balls (Fair-ONB), an undersampling method that harnesses the data morphology of the different data groups (obtained from the combination of classes and protected feature values) to perform guided undersampling in the areas where they overlap. It employs attributes of the ball coverage of the groups, such as the radius, number of covered instances and density, to select the most suitable areas for undersampling and reduce bias. Results show that the Fair-ONB method reduces bias with low impact on the classifier's predictive performance.

7/22/2024

🛸

Overlap Number of Balls Model-Agnostic CounterFactuals (ONB-MACF): A Data-Morphology-based Counterfactual Generation Method for Trustworthy Artificial Intelligence

Jos'e Daniel Pascual-Triana (Andalusian Institute of Data Science and Computational Intelligence), Alberto Fern'andez (Andalusian Institute of Data Science and Computational Intelligence), Javier Del Ser (Andalusian Institute of Data Science and Computational Intelligence, University of the Basque Country), Francisco Herrera (Andalusian Institute of Data Science and Computational Intelligence)

Explainable Artificial Intelligence (XAI) is a pivotal research domain aimed at understanding the operational mechanisms of AI systems, particularly those considered ``black boxes'' due to their complex, opaque nature. XAI seeks to make these AI systems more understandable and trustworthy, providing insight into their decision-making processes. By producing clear and comprehensible explanations, XAI enables users, practitioners, and stakeholders to trust a model's decisions. This work analyses the value of data morphology strategies in generating counterfactual explanations. It introduces the Overlap Number of Balls Model-Agnostic CounterFactuals (ONB-MACF) method, a model-agnostic counterfactual generator that leverages data morphology to estimate a model's decision boundaries. The ONB-MACF method constructs hyperspheres in the data space whose covered points share a class, mapping the decision boundary. Counterfactuals are then generated by incrementally adjusting an instance's attributes towards the nearest alternate-class hypersphere, crossing the decision boundary with minimal modifications. By design, the ONB-MACF method generates feasible and sparse counterfactuals that follow the data distribution. Our comprehensive benchmark from a double perspective (quantitative and qualitative) shows that the ONB-MACF method outperforms existing state-of-the-art counterfactual generation methods across multiple quality metrics on diverse tabular datasets. This supports our hypothesis, showcasing the potential of data-morphology-based explainability strategies for trustworthy AI.

5/22/2024

↗️

Minimum Enclosing Ball Synthetic Minority Oversampling Technique from a Geometric Perspective

Yi-Yang Shangguan, Shi-Shun Chen, Xiao-Yang Li

Class imbalance refers to the significant difference in the number of samples from different classes within a dataset, making it challenging to identify minority class samples correctly. This issue is prevalent in real-world classification tasks, such as software defect prediction, medical diagnosis, and fraud detection. The synthetic minority oversampling technique (SMOTE) is widely used to address class imbalance issue, which is based on interpolation between randomly selected minority class samples and their neighbors. However, traditional SMOTE and most of its variants only interpolate between existing samples, which may be affected by noise samples in some cases and synthesize samples that lack diversity. To overcome these shortcomings, this paper proposes the Minimum Enclosing Ball SMOTE (MEB-SMOTE) method from a geometry perspective. Specifically, MEB is innovatively introduced into the oversampling method to construct a representative point. Then, high-quality samples are synthesized by interpolation between this representative point and the existing samples. The rationale behind constructing a representative point is discussed, demonstrating that the center of MEB is more suitable as the representative point. To exhibit the superiority of MEB-SMOTE, experiments are conducted on 15 real-world imbalanced datasets. The results indicate that MEB-SMOTE can effectively improve the classification performance on imbalanced datasets.

8/9/2024

Counterfactual Fairness through Transforming Data Orthogonal to Bias

Shuyi Chen, Shixiang Zhu

Machine learning models have shown exceptional prowess in solving complex issues across various domains. However, these models can sometimes exhibit biased decision-making, resulting in unequal treatment of different groups. Despite substantial research on counterfactual fairness, methods to reduce the impact of multivariate and continuous sensitive variables on decision-making outcomes are still underdeveloped. We propose a novel data pre-processing algorithm, Orthogonal to Bias (OB), which is designed to eliminate the influence of a group of continuous sensitive variables, thus promoting counterfactual fairness in machine learning applications. Our approach, based on the assumption of a jointly normal distribution within a structural causal model (SCM), demonstrates that counterfactual fairness can be achieved by ensuring the data is orthogonal to the observed sensitive variables. The OB algorithm is model-agnostic, making it applicable to a wide range of machine learning models and tasks. Additionally, it includes a sparse variant to improve numerical stability through regularization. Empirical evaluations on both simulated and real-world datasets, encompassing settings with both discrete and continuous sensitive variables, show that our methodology effectively promotes fairer outcomes without compromising accuracy.

7/2/2024