AlleNoise -- large-scale text classification benchmark dataset with real-world label noise

Read original: arXiv:2407.10992 - Published 7/17/2024 by Alicja Rk{a}czkowska, Aleksandra Osowska-Kurczab, Jacek Szczerbi'nski, Kalina Jasinska-Kobus, Klaudia Nazarko

AlleNoise -- large-scale text classification benchmark dataset with real-world label noise

Overview

This paper introduces a new large-scale text classification dataset called AlleNoise, which is designed to address real-world label noise challenges.
The dataset contains over 1 million examples across 20 classes with varying levels of label noise, making it a valuable benchmark for evaluating machine learning models' robustness to noisy labels.
The paper also explores strategies for training models that are resilient to label noise, including methods like noisy label processing and graph neural networks under noisy conditions.

Plain English Explanation

The researchers created a new dataset called AlleNoise to help test how well machine learning models can handle real-world "noisy" labels. In many real-world datasets, the labels (the information used to train the model) can be inaccurate or inconsistent. This can make it challenging for models to learn effectively.

The AlleNoise dataset has over 1 million examples across 20 different categories, and the labels have varying levels of noise or inaccuracy. This allows researchers to evaluate how well different machine learning approaches, like noisy label processing and graph neural networks under noisy conditions, can deal with this type of noisy label problem.

By having a large, standardized dataset with known label noise, the researchers hope to accelerate progress in building more robust and reliable machine learning models that can perform well even when the training data is not perfectly clean.

Technical Explanation

The paper introduces a new large-scale text classification dataset called AlleNoise, which is designed to serve as a benchmark for evaluating machine learning models' ability to handle real-world label noise. The dataset contains over 1 million examples spanning 20 different classes, with varying degrees of label noise introduced to mimic realistic scenarios.

To create the dataset, the researchers first collected a large corpus of online text data from diverse sources. They then used a combination of automated and human-based labeling approaches to assign class labels to the examples, intentionally introducing errors and inconsistencies to generate the desired label noise distributions.

The paper also explores strategies for training models that are resilient to label noise, including methods like noisy label processing and graph neural networks under noisy conditions. These approaches aim to help models learn effectively even when the training data contains inaccurate or inconsistent labels.

The researchers evaluate the performance of various machine learning models on the AlleNoise dataset, comparing their robustness to label noise with other benchmark datasets like NoisyAG-News and NoiseBench. The results provide valuable insights into the strengths and limitations of different modeling approaches in the face of real-world label noise challenges.

Critical Analysis

The AlleNoise dataset represents a significant contribution to the field, as it provides a large-scale, standardized benchmark for evaluating machine learning models' robustness to label noise. By systematically introducing varying levels of noise, the dataset allows researchers to better understand the impact of noisy labels and test the effectiveness of different techniques for mitigating this challenge.

One potential limitation of the dataset is that the label noise is artificially introduced, rather than reflecting the true, complex patterns of noise that may occur in real-world data. While the researchers have made efforts to model realistic noise distributions, the dataset may not fully capture the nuances and idiosyncrasies of label noise in actual applications.

Additionally, the dataset is focused on text classification, which may limit its generalizability to other domains, such as image recognition or time series analysis. It would be valuable to see similar large-scale, noisy label datasets developed for a broader range of machine learning tasks.

Overall, the AlleNoise dataset and the techniques explored in the paper represent an important step forward in addressing the challenge of label noise in machine learning. By providing a robust benchmark and highlighting promising approaches, the research helps pave the way for the development of more reliable and trustworthy AI systems.

Conclusion

The AlleNoise dataset introduced in this paper is a valuable resource for the machine learning community, as it provides a large-scale, standardized benchmark for evaluating models' robustness to real-world label noise. By systematically incorporating varying levels of label noise, the dataset allows researchers to better understand the impact of this challenge and test the effectiveness of different techniques, such as noisy label processing and graph neural networks under noisy conditions.

The insights gained from this research have the potential to significantly improve the reliability and trustworthiness of machine learning models, particularly in applications where label noise is a common issue, such as multilingual text classification. By addressing this challenge, the field can move closer to developing AI systems that can robustly handle the complexities and uncertainties of real-world data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

AlleNoise -- large-scale text classification benchmark dataset with real-world label noise

Alicja Rk{a}czkowska, Aleksandra Osowska-Kurczab, Jacek Szczerbi'nski, Kalina Jasinska-Kobus, Klaudia Nazarko

Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set the bar high for the development of label noise methods that can handle real-world label noise in text classification tasks. The code and dataset are available for download at https://github.com/allegro/AlleNoise.

7/17/2024

NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification

Hongfei Huang, Tingting Liang, Xixi Sun, Zikang Jin, Yuyu Yin

Existing research on learning with noisy labels predominantly focuses on synthetic label noise. Although synthetic noise possesses well-defined structural properties, it often fails to accurately replicate real-world noise patterns. In recent years, there has been a concerted effort to construct generalizable and controllable instance-dependent noise datasets for image classification, significantly advancing the development of noise-robust learning in this area. However, studies on noisy label learning for text classification remain scarce. To better understand label noise in real-world text classification settings, we constructed the benchmark dataset NoisyAG-News through manual annotation. Initially, we analyzed the annotated data to gather observations about real-world noise. We qualitatively and quantitatively demonstrated that real-world noisy labels adhere to instance-dependent patterns. Subsequently, we conducted comprehensive learning experiments on NoisyAG-News and its corresponding synthetic noise datasets using pre-trained language models and noise-handling techniques. Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise, with samples of varying confusion levels showing inconsistent performance during training and testing. These real-world noise patterns pose new, significant challenges, prompting a reevaluation of noisy label handling methods. We hope that NoisyAG-News will facilitate the development and evaluation of future solutions for learning with noisy labels.

7/10/2024

👁️

NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition

Elena Merdjanovska, Ansar Aynetdinov, Alan Akbik

Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound. We release NoiseBench to the research community.

5/14/2024

Noisy Label Processing for Classification: A Survey

Mengting Li, Chuang Zhu

In recent years, deep neural networks (DNNs) have gained remarkable achievement in computer vision tasks, and the success of DNNs often depends greatly on the richness of data. However, the acquisition process of data and high-quality ground truth requires a lot of manpower and money. In the long, tedious process of data annotation, annotators are prone to make mistakes, resulting in incorrect labels of images, i.e., noisy labels. The emergence of noisy labels is inevitable. Moreover, since research shows that DNNs can easily fit noisy labels, the existence of noisy labels will cause significant damage to the model training process. Therefore, it is crucial to combat noisy labels for computer vision tasks, especially for classification tasks. In this survey, we first comprehensively review the evolution of different deep learning approaches for noisy label combating in the image classification task. In addition, we also review different noise patterns that have been proposed to design robust algorithms. Furthermore, we explore the inner pattern of real-world label noise and propose an algorithm to generate a synthetic label noise pattern guided by real-world data. We test the algorithm on the well-known real-world dataset CIFAR-10N to form a new real-world data-guided synthetic benchmark and evaluate some typical noise-robust methods on the benchmark.

4/8/2024