VariErr NLI: Separating Annotation Error from Human Label Variation

Read original: arXiv:2403.01931 - Published 6/7/2024 by Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

🤯

Overview

• This paper introduces a new dataset, VariErr, that focuses on understanding the difference between human label variation and annotation errors in the context of natural language inference (NLI) tasks. • The researchers propose a two-round annotation process where annotators first assign labels and then judge the validity of the label-explanation pairs. • The paper evaluates the effectiveness of various automatic error detection (AED) methods and large language models (GPTs) in distinguishing between valid label variation and annotation errors.

Plain English Explanation

The paper explores the difference between human label variation and annotation errors in natural language processing (NLP) benchmarks. When people are asked to label the same piece of text, they may assign different labels for valid reasons, like differing interpretations. On the other hand, annotation errors occur when labels are assigned for invalid reasons.

These two issues - variation and errors - are common in NLP datasets, but past research has studied them separately. This paper aims to tease apart the error from the valid signal, especially in cases where the "right" answer is not black-and-white.

To do this, the researchers created a new dataset called VariErr, focused on the NLI (natural language inference) task in English. They had annotators not only assign labels, but also explain their reasoning. Then, a second set of annotators judged whether the labels and explanations were valid. This dataset provides a way to distinguish true label variation from annotation errors.

The paper also evaluates different AI methods, including state-of-the-art automatic error detection (AED) techniques and large language models (GPTs), to see how well they can identify errors versus valid variation. They found that the GPTs, especially the powerful GPT-4, perform better than the AED methods, but still fall short of human-level performance at this task.

Technical Explanation

The researchers introduce a new dataset called VariErr that focuses on the task of natural language inference (NLI) in English. NLI is a common benchmark in NLP where the goal is to determine the relationship (e.g., entailment, contradiction, neutral) between two input text snippets.

To create VariErr, the team re-annotated 500 items from the existing MNLI dataset. They used a two-round annotation process:

Annotators assigned NLI labels to the items.
A second set of annotators judged the validity of the label-explanation pairs provided by the initial annotators.

This process resulted in 7,732 validity judgments on 1,933 explanations, providing a dataset that can distinguish between valid label variation and annotation errors.

The researchers then evaluated the performance of various automatic error detection (AED) methods and large language models (GPTs) in identifying errors versus valid variation. They found that state-of-the-art AED methods significantly underperformed compared to the GPTs, including the powerful GPT-4. However, even GPT-4 fell short of human-level performance on this task.

Critical Analysis

The researchers acknowledge that their methodology and dataset are focused on the NLI task in English, and further work is needed to determine if the findings generalize to other NLP tasks and languages. Additionally, they note that their study did not explore the reasons behind the label variation, which could provide valuable insights.

One potential limitation is that the VariErr dataset may not fully capture the complexity of real-world NLP tasks, where the "right" answer is often more nuanced than a simple binary classification. The researchers could consider expanding the dataset to include a wider range of NLP tasks and more ambiguous cases.

Furthermore, the paper does not delve into the specific shortcomings of the AED methods or the areas where the GPTs struggle. A more detailed analysis of the error types and their sources could help researchers develop more effective techniques for distinguishing valid variation from annotation errors.

Conclusion

This paper presents a novel approach to understanding the difference between human label variation and annotation errors in NLP benchmarks. The VariErr dataset and the comparative evaluation of AED methods and GPTs provide valuable insights for the development of more trustworthy and reliable NLP systems.

The findings suggest that existing AED techniques may not be sufficient for the nuanced task of differentiating valid variation from errors, and that large language models like GPT-4 can be more effective, although still imperfect. This work lays the groundwork for future research on improving the robustness and interpretability of NLP systems by better accounting for the signal and noise in human annotations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤯

VariErr NLI: Separating Annotation Error from Human Label Variation

Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

6/7/2024

📈

Annotation Errors and NER: A Study with OntoNotes 5.0

Gabriel Bernier-Colborne, Sowmya Vajjala

Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (>10%) improvements for some of the entity types. While our annotation error detection methods are not exhaustive and there is some manual annotation effort involved, they are largely language agnostic and can be employed with other NER datasets, and other sequence labelling tasks.

6/28/2024

Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language Models

Marvin Pafla, Kate Larson, Mark Hancock

The field of eXplainable artificial intelligence (XAI) has produced a plethora of methods (e.g., saliency-maps) to gain insight into artificial intelligence (AI) models, and has exploded with the rise of deep learning (DL). However, human-participant studies question the efficacy of these methods, particularly when the AI output is wrong. In this study, we collected and analyzed 156 human-generated text and saliency-based explanations collected in a question-answering task (N=40) and compared them empirically to state-of-the-art XAI explanations (integrated gradients, conservative LRP, and ChatGPT) in a human-participant study (N=136). Our findings show that participants found human saliency maps to be more helpful in explaining AI answers than machine saliency maps, but performance negatively correlated with trust in the AI model and explanations. This finding hints at the dilemma of AI errors in explanation, where helpful explanations can lead to lower task performance when they support wrong AI predictions.

4/12/2024

🔎

CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English

Andrew Rueda, Elena 'Alvarez Mellado, Constantine Lignos

Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark CoNLL-03 English dataset. In this paper, we perform a deep dive into the test outputs of the highest-performing NER models, conducting a fine-grained evaluation of their performance by introducing new document-level annotations on the test set. We go beyond F1 scores by categorizing errors in order to interpret the true state of the art for NER and guide future work. We review previous attempts at correcting the various flaws of the test set and introduce CoNLL#, a new corrected version of the test set that addresses its systematic and most prevalent errors, allowing for low-noise, interpretable error analysis.

5/21/2024