NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition

Read original: arXiv:2405.07609 - Published 5/14/2024 by Elena Merdjanovska, Ansar Aynetdinov, Alan Akbik

👁️

Overview

Addresses the challenge of label noise in named entity recognition (NER) datasets
Introduces a new NER benchmark, NoiseBench, with 6 types of real-world label noise
Finds that real-world noise is significantly more challenging than simulated noise
Reveals that current state-of-the-art noise-robust models fall short of their theoretical limits

Plain English Explanation

Named entity recognition (NER) is the task of identifying and classifying important named entities like people, organizations, and locations in text. However, the datasets used to train NER models often contain a significant number of incorrect labels, where the entity type or boundary is wrong. This label noise can seriously degrade the performance of NER models.

Prior research has proposed various "noise-robust" learning methods that can handle data with some incorrect labels. But these methods are typically evaluated using simulated noise, where the labels in a clean dataset are automatically corrupted. As shown in the paper on noisy label processing, this type of simulated noise is often much easier to handle than real-world noise caused by human error or semi-automatic annotation.

To better understand the impact of real-world noise, the researchers introduce a new NER benchmark called NoiseBench. NoiseBench contains clean NER training data that has been corrupted with 6 different types of real-world noise, including errors made by experts, crowdsourcing workers, automatic annotation tools, and large language models.

The analysis in the paper on the impact of human-annotated label noise on ConvNets shows that this real-world noise is significantly more challenging than simulated noise. The researchers also find that current state-of-the-art noise-robust learning models fall far short of their theoretically achievable performance on NoiseBench.

These findings suggest that more work is needed to develop NER models that can effectively handle the types of noise found in real-world NER datasets. The NoiseBench benchmark provides a valuable tool for researchers to test and improve their noise-robust learning methods.

Technical Explanation

The paper addresses the problem of label noise in named entity recognition (NER) datasets. NER is the task of identifying and classifying important named entities like people, organizations, and locations in text. However, the training data for NER models often contains a significant percentage of incorrect labels, where the entity type or boundary is wrong.

To study the impact of this label noise, the researchers introduce a new NER benchmark called NoiseBench. NoiseBench consists of clean NER training data that has been corrupted with 6 different types of real-world noise:

Expert errors: Mistakes made by professional annotators
Crowdsourcing errors: Mistakes made by crowdsourced workers
Automatic annotation errors: Mistakes made by automated annotation tools
LLM errors: Mistakes made by large language models used for annotation
Mixed errors: A combination of the above types of errors
Adversarial errors: Carefully crafted mistakes designed to fool NER models

The researchers evaluate several state-of-the-art noise-robust learning approaches, including methods that augment NER datasets with LLM-generated annotations and models that use a "mix of experts" approach, on the NoiseBench benchmark.

The analysis reveals that real-world noise is significantly more challenging than simulated noise, and that current noise-robust learning methods fall far short of their theoretically achievable upper bound on NoiseBench. The instance-dependent noisy label learning approach is shown to be the best-performing model, but there is still a large gap between its performance and the theoretical limit.

Critical Analysis

The paper makes a valuable contribution by highlighting the significant gap between the performance of noise-robust learning methods on simulated noise and their performance on real-world noise. This is an important finding, as it suggests that the conclusions drawn from studies using simulated noise may not translate to real-world applications.

One limitation of the study is that it only considers 6 types of real-world noise. While these cover a range of common sources of label noise, there may be other types of noise that are not captured. Additionally, the paper does not provide detailed insights into why certain types of noise are more challenging than others.

Further research could explore the underlying reasons for the performance gap between simulated and real-world noise, as well as investigate additional types of real-world noise. It would also be valuable to see how the noise-robust learning methods perform on a wider range of NER datasets, not just the one used in this study.

Overall, the NoiseBench benchmark and the findings presented in this paper are an important step towards developing more robust and reliable NER models that can handle the complexities of real-world data.

Conclusion

This paper addresses the challenge of label noise in named entity recognition (NER) datasets and introduces a new benchmark, NoiseBench, to study the impact of various types of real-world noise. The analysis shows that real-world noise is significantly more challenging than simulated noise, and that current state-of-the-art noise-robust learning methods fall far short of their theoretical limits on the NoiseBench dataset.

These findings highlight the need for further research to develop more effective noise-robust learning approaches that can handle the types of label noise encountered in real-world NER applications. The NoiseBench benchmark provides a valuable tool for researchers to test and improve their methods, ultimately leading to more robust and reliable NER models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition

Elena Merdjanovska, Ansar Aynetdinov, Alan Akbik

Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound. We release NoiseBench to the research community.

5/14/2024

AlleNoise -- large-scale text classification benchmark dataset with real-world label noise

Alicja Rk{a}czkowska, Aleksandra Osowska-Kurczab, Jacek Szczerbi'nski, Kalina Jasinska-Kobus, Klaudia Nazarko

Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set the bar high for the development of label noise methods that can handle real-world label noise in text classification tasks. The code and dataset are available for download at https://github.com/allegro/AlleNoise.

7/17/2024

Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation

Chaoyi Ai, Yong Jiang, Shen Huang, Pengjun Xie, Kewei Tu

Named entity recognition (NER) models often struggle with noisy inputs, such as those with spelling mistakes or errors generated by Optical Character Recognition processes, and learning a robust NER model is challenging. Existing robust NER models utilize both noisy text and its corresponding gold text for training, which is infeasible in many real-world applications in which gold text is not available. In this paper, we consider a more realistic setting in which only noisy text and its NER labels are available. We propose to retrieve relevant text of the noisy text from a knowledge corpus and use it to enhance the representation of the original noisy input. We design three retrieval methods: sparse retrieval based on lexicon similarity, dense retrieval based on semantic similarity, and self-retrieval based on task-specific text. After retrieving relevant text, we concatenate the retrieved text with the original noisy text and encode them with a transformer network, utilizing self-attention to enhance the contextual token representations of the noisy text using the retrieved text. We further employ a multi-view training framework that improves robust NER without retrieving text during inference. Experiments show that our retrieval-augmented model achieves significant improvements in various noisy NER settings.

7/29/2024

Noisy Label Processing for Classification: A Survey

Mengting Li, Chuang Zhu

In recent years, deep neural networks (DNNs) have gained remarkable achievement in computer vision tasks, and the success of DNNs often depends greatly on the richness of data. However, the acquisition process of data and high-quality ground truth requires a lot of manpower and money. In the long, tedious process of data annotation, annotators are prone to make mistakes, resulting in incorrect labels of images, i.e., noisy labels. The emergence of noisy labels is inevitable. Moreover, since research shows that DNNs can easily fit noisy labels, the existence of noisy labels will cause significant damage to the model training process. Therefore, it is crucial to combat noisy labels for computer vision tasks, especially for classification tasks. In this survey, we first comprehensively review the evolution of different deep learning approaches for noisy label combating in the image classification task. In addition, we also review different noise patterns that have been proposed to design robust algorithms. Furthermore, we explore the inner pattern of real-world label noise and propose an algorithm to generate a synthetic label noise pattern guided by real-world data. We test the algorithm on the well-known real-world dataset CIFAR-10N to form a new real-world data-guided synthetic benchmark and evaluate some typical noise-robust methods on the benchmark.

4/8/2024