Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Read original: arXiv:2408.14141 - Published 8/27/2024 by Urja Khurana, Eric Nalisnick, Antske Fokkens, Swabha Swayamdipta

Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Overview

The paper examines how annotator disagreement can inform calibration in subjective tasks like hate speech detection.
It explores the use of "soft labels" - probabilistic labels that capture the full distribution of annotator responses - to better represent the inherent subjectivity in these tasks.
The research investigates how this additional information can be leveraged to improve model calibration and performance.

Plain English Explanation

Hate speech detection is a challenging task because it involves subjective judgments. Different people may have different opinions on whether a particular piece of content is hateful or not. The paper explores how capturing this disagreement between annotators can provide useful information for training better hate speech detection models.

Instead of just getting a single "yes" or "no" label from each annotator, the researchers collect "soft labels" - a probability distribution that represents the full range of opinions. This allows the model to learn not just the final label, but also the uncertainty around it.

By using this additional information about annotator disagreement, the researchers aim to improve the calibration of the hate speech detection models. Calibration refers to how well the model's predicted probabilities match the true probabilities of the outcomes. Well-calibrated models can provide more reliable and trustworthy outputs.

The key idea is that the soft labels capture the inherent subjectivity of hate speech, which can then be leveraged to train models that are better at handling this type of ambiguity and producing more reliable predictions.

Technical Explanation

The paper proposes a framework called "Crowd-Calibrator" that uses soft labels derived from annotator disagreement to improve model calibration in subjective tasks like hate speech detection.

In the hate speech detection task, instead of getting binary labels ("hateful" or "not hateful") from each annotator, the researchers collected probabilistic "soft labels" that capture the full distribution of responses. This allows the model to learn not just the final label, but also the uncertainty around it.

The Crowd-Calibrator framework then uses this additional information about annotator disagreement in two ways:

Calibration: The soft labels are used to calibrate the model's output probabilities, ensuring they better match the true probabilities of the outcomes.
Training: The soft labels are incorporated into the training process, allowing the model to learn from the full range of annotator perspectives rather than just the final consensus label.

The researchers evaluated their approach on several hate speech detection datasets and found that the Crowd-Calibrator framework improved model calibration and performance compared to using binary labels or other calibration methods. They also observed that the benefits were particularly pronounced for more subjective or ambiguous examples, where annotator disagreement was higher.

Critical Analysis

The paper makes a compelling case for leveraging annotator disagreement to improve model calibration and performance in subjective tasks. The use of soft labels to capture the full range of annotator perspectives is a valuable approach that could have applications beyond hate speech detection.

However, the paper does not address some potential limitations and challenges:

The soft label approach may be more resource-intensive, as it requires collecting multiple annotations per example instead of a single binary label.
The paper focuses on a single task (hate speech detection), and it's unclear how well the Crowd-Calibrator framework would generalize to other subjective tasks with different types of ambiguity or annotator biases.
The paper does not explore the impact of different aggregation methods for deriving soft labels from the individual annotations, which could potentially affect the results.

Overall, the Crowd-Calibrator approach is a promising direction for improving model calibration and handling subjectivity in various AI applications. Further research exploring its limitations and broader applicability would be valuable.

Conclusion

This paper presents a novel framework called Crowd-Calibrator that leverages annotator disagreement to improve model calibration and performance in subjective tasks like hate speech detection. By using probabilistic "soft labels" to capture the full range of annotator perspectives, the approach allows models to learn from the inherent subjectivity in the task.

The key findings of the paper demonstrate the benefits of this approach, particularly in handling ambiguous or subjective examples where annotator disagreement is higher. The Crowd-Calibrator framework could have broader implications for improving the reliability and trustworthiness of AI systems in a wide range of subjective domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Urja Khurana, Eric Nalisnick, Antske Fokkens, Swabha Swayamdipta

Subjective tasks in NLP have been mostly relegated to objective standards, where the gold label is decided by taking the majority vote. This obfuscates annotator disagreement and the inherent uncertainty of the label. We argue that subjectivity should factor into model decisions and play a direct role via calibration under a selective prediction setting. Specifically, instead of calibrating confidence purely from the model's perspective, we calibrate models for subjective tasks based on crowd worker agreement. Our method, Crowd-Calibrator, models the distance between the distribution of crowd worker labels and the model's own distribution over labels to inform whether the model should abstain from a decision. On two highly subjective tasks, hate speech detection and natural language inference, our experiments show Crowd-Calibrator either outperforms or achieves competitive performance with existing selective prediction baselines. Our findings highlight the value of bringing human decision-making into model predictions.

8/27/2024

🌀

Noise Correction on Subjective Datasets

Uthman Jinadu, Yi Ding

Incorporating every annotator's perspective is crucial for unbiased data modeling. Annotator fatigue and changing opinions over time can distort dataset annotations. To combat this, we propose to learn a more accurate representation of diverse opinions by utilizing multitask learning in conjunction with loss-based label correction. We show that using our novel formulation, we can cleanly separate agreeing and disagreeing annotations. Furthermore, this method provides a controllable way to encourage or discourage disagreement. We demonstrate that this modification can improve prediction performance in a single or multi-annotator setting. Lastly, we show that this method remains robust to additional label noise that is applied to subjective data.

6/5/2024

🏅

Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks

Negar Mokhberian, Myrl G. Marmarelis, Frederic R. Hopp, Valerio Basile, Fred Morstatter, Kristina Lerman

Supervised classification heavily depends on datasets annotated by humans. However, in subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters. Annotations have commonly been aggregated by employing methods like majority voting to determine a single ground truth label. In subjective tasks, aggregating labels will result in biased labeling and, consequently, biased models that can overlook minority opinions. Previous studies have shed light on the pitfalls of label aggregation and have introduced a handful of practical approaches to tackle this issue. Recently proposed multi-annotator models, which predict labels individually per annotator, are vulnerable to under-determination for annotators with few samples. This problem is exacerbated in crowdsourced datasets. In this work, we propose textbf{Annotator Aware Representations for Texts (AART)} for subjective classification tasks. Our approach involves learning representations of annotators, allowing for exploration of annotation behaviors. We show the improvement of our method on metrics that assess the performance on capturing individual annotators' perspectives. Additionally, we demonstrate fairness metrics to evaluate our model's equability of performance for marginalized annotators compared to others.

5/17/2024

Cost-Efficient Subjective Task Annotation and Modeling through Few-Shot Annotator Adaptation

Preni Golazizian, Alireza S. Ziabari, Ali Omrani, Morteza Dehghani

In subjective NLP tasks, where a single ground truth does not exist, the inclusion of diverse annotators becomes crucial as their unique perspectives significantly influence the annotations. In realistic scenarios, the annotation budget often becomes the main determinant of the number of perspectives (i.e., annotators) included in the data and subsequent modeling. We introduce a novel framework for annotation collection and modeling in subjective tasks that aims to minimize the annotation budget while maximizing the predictive performance for each annotator. Our framework has a two-stage design: first, we rely on a small set of annotators to build a multitask model, and second, we augment the model for a new perspective by strategically annotating a few samples per annotator. To test our framework at scale, we introduce and release a unique dataset, Moral Foundations Subjective Corpus, of 2000 Reddit posts annotated by 24 annotators for moral sentiment. We demonstrate that our framework surpasses the previous SOTA in capturing the annotators' individual perspectives with as little as 25% of the original annotation budget on two datasets. Furthermore, our framework results in more equitable models, reducing the performance disparity among annotators.

9/6/2024