Leveraging Ontologies to Document Bias in Data

Read original: arXiv:2407.00509 - Published 8/13/2024 by Mayra Russo, Maria-Esther Vidal
Total Score

0

Leveraging Ontologies to Document Bias in Data

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper explores the use of ontologies to document bias in datasets used for machine learning.
  • The authors propose a framework for leveraging ontologies to identify and characterize biases in data, which can then be used to mitigate these biases in machine learning models.
  • The paper discusses the importance of understanding and addressing data bias, and how ontologies can provide a structured way to capture and reason about bias in datasets.

Plain English Explanation

The paper is about using a specific type of knowledge representation called an "ontology" to help identify and understand biases in the data used to train machine learning models. Bias in data is a big problem in AI, as it can lead to models that make unfair or inaccurate predictions.

The authors suggest that by creating ontologies that capture information about the different types of bias that can exist in data, we can better understand and document these biases. This can then help us take steps to mitigate the biases and build more fair and reliable AI systems. [Linking to: https://aimodels.fyi/papers/arxiv/towards-objective-systematic-evaluation-bias-artificial-intelligence]

Ontologies are like structured dictionaries that define the key concepts in a domain and the relationships between them. The authors propose using ontologies to formally describe the different ways data can be biased, such as demographic biases, historical biases, or cognitive biases introduced by the people annotating the data. [Linking to: https://aimodels.fyi/papers/arxiv/language-guided-detection-mitigation-unknown-dataset-bias]

By having this structured way to capture and reason about biases, the researchers believe we can better understand the limitations of our datasets and the potential issues that may arise when using them to train machine learning models. This can lead to more transparent and accountable AI systems. [Linking to: https://aimodels.fyi/papers/arxiv/towards-standardizing-ai-bias-exploration]

Technical Explanation

The paper proposes a framework for leveraging ontologies to document bias in data used for machine learning. The authors argue that ontologies can provide a structured way to capture and reason about different types of biases that may exist in datasets.

The framework consists of three key components:

  1. Bias Ontology: The authors develop an ontology that defines the different categories of bias that can occur in data, such as demographic bias, historical bias, and cognitive bias introduced by human annotators. [Linking to: https://aimodels.fyi/papers/arxiv/docnet-semantic-structure-inductive-bias-detection-models]

  2. Bias Detection: The ontology is used to guide the process of detecting biases in a given dataset. This involves analyzing the dataset's content, metadata, and the processes used to collect and annotate the data.

  3. Bias Characterization: The detected biases are then characterized using the concepts and relationships defined in the ontology. This provides a structured way to document the biases and understand their potential impact on machine learning models.

The authors demonstrate the application of their framework using a case study on a dataset of animal images. They show how the bias ontology can be used to identify and characterize various biases in the dataset, such as geographic and demographic skews.

Critical Analysis

The paper presents a promising approach to addressing the critical issue of data bias in machine learning. By leveraging ontologies, the authors provide a structured and systematic way to document biases, which is an important step towards mitigating their negative impacts.

One potential limitation of the proposed framework is that it relies on the completeness and accuracy of the bias ontology. The authors acknowledge that the ontology they developed may not capture all possible types of biases, and further research is needed to refine and expand it. [Linking to: https://aimodels.fyi/papers/arxiv/blind-spots-biases-exploring-role-annotator-cognitive]

Additionally, the process of detecting and characterizing biases using the ontology may require significant human effort and domain expertise. Automating these tasks, or developing more user-friendly tools, could help make the framework more accessible to a broader range of practitioners.

Overall, the paper makes a valuable contribution to the ongoing efforts to address bias in AI systems. The use of ontologies represents a promising approach that merits further exploration and refinement.

Conclusion

This paper presents a framework for leveraging ontologies to document bias in data used for machine learning. The authors argue that ontologies can provide a structured way to capture and reason about different types of biases, which is crucial for understanding and mitigating the negative impacts of biases in AI systems.

The proposed framework involves developing a bias ontology, using it to detect biases in datasets, and then characterizing those biases. The authors demonstrate the application of their approach through a case study, showing how the bias ontology can be used to identify and describe various biases in a dataset of animal images.

While the paper presents a promising approach, the authors acknowledge the need for further research to refine and expand the bias ontology, as well as to develop more automated tools for detecting and characterizing biases. Nonetheless, the use of ontologies represents an important step towards more transparent and accountable AI systems.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Leveraging Ontologies to Document Bias in Data
Total Score

0

Leveraging Ontologies to Document Bias in Data

Mayra Russo, Maria-Esther Vidal

Machine Learning (ML) systems are capable of reproducing and often amplifying undesired biases. This puts emphasis on the importance of operating under practices that enable the study and understanding of the intrinsic characteristics of ML pipelines, prompting the emergence of documentation frameworks with the idea that ``any remedy for bias starts with awareness of its existence''. However, a resource that can formally describe these pipelines in terms of biases detected is still amiss. To fill this gap, we present the Doc-BiasO ontology, a resource that aims to create an integrated vocabulary of biases defined in the textit{fair-ML} literature and their measures, as well as to incorporate relevant terminology and the relationships between them. Overseeing ontology engineering best practices, we re-use existing vocabulary on machine learning and AI, to foster knowledge sharing and interoperability between the actors concerned with its research, development, regulation, among others. Overall, our main objective is to contribute towards clarifying existing terminology on bias research as it rapidly expands to all areas of AI and to improve the interpretation of bias in data and downstream impact.

Read more

8/13/2024

Reducing Biases towards Minoritized Populations in Medical Curricular Content via Artificial Intelligence for Fairer Health Outcomes
Total Score

0

Reducing Biases towards Minoritized Populations in Medical Curricular Content via Artificial Intelligence for Fairer Health Outcomes

Chiman Salavati, Shannon Song, Willmar Sosa Diaz, Scott A. Hale, Roberto E. Montenegro, Fabricio Murai, Shiri Dori-Hacohen

Biased information (recently termed bisinformation) continues to be taught in medical curricula, often long after having been debunked. In this paper, we introduce BRICC, a firstin-class initiative that seeks to mitigate medical bisinformation using machine learning to systematically identify and flag text with potential biases, for subsequent review in an expert-in-the-loop fashion, thus greatly accelerating an otherwise labor-intensive process. A gold-standard BRICC dataset was developed throughout several years, and contains over 12K pages of instructional materials. Medical experts meticulously annotated these documents for bias according to comprehensive coding guidelines, emphasizing gender, sex, age, geography, ethnicity, and race. Using this labeled dataset, we trained, validated, and tested medical bias classifiers. We test three classifier approaches: a binary type-specific classifier, a general bias classifier; an ensemble combining bias type-specific classifiers independently-trained; and a multitask learning (MTL) model tasked with predicting both general and type-specific biases. While MTL led to some improvement on race bias detection in terms of F1-score, it did not outperform binary classifiers trained specifically on each task. On general bias detection, the binary classifier achieves up to 0.923 of AUC, a 27.8% improvement over the baseline. This work lays the foundations for debiasing medical curricula by exploring a novel dataset and evaluating different training model strategies. Hence, it offers new pathways for more nuanced and effective mitigation of bisinformation.

Read more

7/18/2024

💬

Total Score

0

Bias and Fairness in Large Language Models: A Survey

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Nesreen K. Ahmed

Rapid advancements of large language models (LLMs) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. Despite this success, these models can learn, perpetuate, and amplify harmful social biases. In this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for LLMs. We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for LLMs. We then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. Our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. Our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. Our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. Finally, we identify open problems and challenges for future work. Synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in LLMs.

Read more

7/16/2024

A Study on Bias Detection and Classification in Natural Language Processing
Total Score

0

A Study on Bias Detection and Classification in Natural Language Processing

Ana Sofia Evans, Helena Moniz, Lu'isa Coheur

Human biases have been shown to influence the performance of models and algorithms in various fields, including Natural Language Processing. While the study of this phenomenon is garnering focus in recent years, the available resources are still relatively scarce, often focusing on different forms or manifestations of biases. The aim of our work is twofold: 1) gather publicly-available datasets and determine how to better combine them to effectively train models in the task of hate speech detection and classification; 2) analyse the main issues with these datasets, such as scarcity, skewed resources, and reliance on non-persistent data. We discuss these issues in tandem with the development of our experiments, in which we show that the combinations of different datasets greatly impact the models' performance.

Read more

8/15/2024