Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

2404.06795

Published 4/11/2024 by Zhuo Li, He Zhao, Zhen Li, Tongliang Liu, Dandan Guo, Xiang Wan

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Abstract

Real-world datasets usually are class-imbalanced and corrupted by label noise. To solve the joint issue of long-tailed distribution and label noise, most previous works usually aim to design a noise detector to distinguish the noisy and clean samples. Despite their effectiveness, they may be limited in handling the joint issue effectively in a unified way. In this work, we develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching, which can be solved with optimal transport (OT). By setting a manually-specific probability measure and using a learned transport plan to pseudo-label the training samples, the proposed method can reduce the side-effects of noisy and long-tailed data simultaneously. Then we introduce a simple yet effective filter criteria by combining the observed labels and pseudo labels to obtain a more balanced and less noisy subset for a robust model training. Extensive experiments demonstrate that our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.

Create account to get full access

Overview

The paper proposes a method for extracting a clean and balanced subset from a noisy long-tailed dataset to improve classification performance.
The approach involves a two-stage process: 1) identifying clean samples using a noise detection model, and 2) balancing the class distribution by oversampling minority classes.
The method is evaluated on several benchmark datasets and shows improvements in classification accuracy compared to previous long-tailed learning techniques.

Plain English Explanation

In machine learning, datasets often have an uneven distribution of classes, with some classes having many more samples than others. This is known as a "long-tailed" distribution. Additionally, the data can be "noisy," meaning there are incorrect or irrelevant samples that can negatively impact model performance.

The researchers in this paper developed a way to address both of these challenges. First, they use a machine learning model to identify the "clean" samples in the dataset - those that are likely to be correctly labeled. Then, they balance the class distribution by creating more samples for the underrepresented classes, a process called oversampling.

By extracting this clean and balanced subset of the data, the researchers were able to train machine learning models that performed better on the overall classification task, compared to using the original long-tailed and noisy dataset. This is an important contribution, as it can help improve the accuracy and reliability of machine learning systems in real-world applications with messy, imbalanced data.

Technical Explanation

The paper proposes a two-stage framework called "Extracting Clean and Balanced Subset" (ECBS) for handling noisy long-tailed datasets. In the first stage, the authors use a noise detection model, such as Latent-based Diffusion Model for Long-tailed Recognition, to identify clean samples from the dataset. This model learns a latent representation that separates clean and noisy samples.

In the second stage, the authors balance the class distribution of the clean subset by oversampling the minority classes. This helps address the long-tailed nature of the data. The authors experiment with various oversampling techniques, including SPDOLLAR2DOLLAROT: Semantic Regularized Progressive Partial Optimal Transport and Three Heads are Better Than One: Complementary.

The authors evaluate their ECBS framework on several benchmark datasets, including Long-tailed Anomaly Detection with Learnable Class Names, and show that it outperforms previous long-tailed learning techniques in terms of classification accuracy.

Critical Analysis

The paper makes a valuable contribution by addressing the challenges of noisy and long-tailed datasets, which are common in real-world machine learning applications. The two-stage approach of cleaning the data and then balancing the class distribution is a well-designed and effective solution.

However, one potential limitation is the reliance on a separate noise detection model, which may not always be available or easy to train. Additionally, the oversampling techniques used in the second stage, while effective, may not be suitable for all types of data or tasks.

Further research could explore ways to integrate the noise detection and class balancing steps into a more unified framework, potentially using end-to-end learning approaches. Investigating the performance of the ECBS method on a wider range of datasets and tasks would also help assess its broader applicability.

Overall, this paper presents a well-thought-out and promising approach to handling noisy long-tailed datasets, and its findings could have significant implications for improving the performance of machine learning systems in real-world scenarios.

Conclusion

The "Extracting Clean and Balanced Subset" (ECBS) framework proposed in this paper offers a effective solution for improving classification accuracy on noisy long-tailed datasets. By first identifying clean samples and then balancing the class distribution, the method is able to overcome the challenges posed by messy, imbalanced data that are common in many real-world applications.

The promising results demonstrated on benchmark datasets suggest that the ECBS approach could have wide-ranging benefits for developing more robust and reliable machine learning systems. As the field continues to grapple with the complexities of real-world data, techniques like ECBS that can extract high-quality subsets from noisy sources will likely become increasingly valuable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!Foster Adaptivity and Balance in Learning with Noisy Labels

Mengmeng Sheng, Zeren Sun, Tao Chen, Shuchao Pang, Yucheng Wang, Yazhou Yao

Label noise is ubiquitous in real-world scenarios, posing a practical challenge to supervised models due to its effect in hurting the generalization performance of deep neural networks. Existing methods primarily employ the sample selection paradigm and usually rely on dataset-dependent prior knowledge (eg, a pre-defined threshold) to cope with label noise, inevitably degrading the adaptivity. Moreover, existing methods tend to neglect the class balance in selecting samples, leading to biased model performance. To this end, we propose a simple yet effective approach named textbf{SED} to deal with label noise in a textbf{S}elf-adaptivtextbf{E} and class-balancetextbf{D} manner. Specifically, we first design a novel sample selection strategy to empower self-adaptivity and class balance when identifying clean and noisy data. A mean-teacher model is then employed to correct labels of noisy samples. Subsequently, we propose a self-adaptive and class-balanced sample re-weighting mechanism to assign different weights to detected noisy samples. Finally, we additionally employ consistency regularization on selected clean samples to improve model generalization performance. Extensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method. The source code has been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/SED.

7/4/2024

cs.CV cs.LG

Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning

Rui Zhao, Bin Shi, Jianfei Ruan, Tianze Pan, Bo Dong

In noisy label learning, estimating noisy class posteriors plays a fundamental role for developing consistent classifiers, as it forms the basis for estimating clean class posteriors and the transition matrix. Existing methods typically learn noisy class posteriors by training a classification model with noisy labels. However, when labels are incorrect, these models may be misled to overemphasize the feature parts that do not reflect the instance characteristics, resulting in significant errors in estimating noisy class posteriors. To address this issue, this paper proposes to augment the supervised information with part-level labels, encouraging the model to focus on and integrate richer information from various parts. Specifically, our method first partitions features into distinct parts by cropping instances, yielding part-level labels associated with these various parts. Subsequently, we introduce a novel single-to-multiple transition matrix to model the relationship between the noisy and part-level labels, which incorporates part-level labels into a classifier-consistent framework. Utilizing this framework with part-level labels, we can learn the noisy class posteriors more precisely by guiding the model to integrate information from various parts, ultimately improving the classification performance. Our method is theoretically sound, while experiments show that it is empirically effective in synthetic and real-world noisy benchmarks.

5/10/2024

cs.CV cs.LG

Latent-based Diffusion Model for Long-tailed Recognition

Pengxiao Han, Changkun Ye, Jieming Zhou, Jing Zhang, Jie Hong, Xuesong Li

Long-tailed imbalance distribution is a common issue in practical computer vision applications. Previous works proposed methods to address this problem, which can be categorized into several classes: re-sampling, re-weighting, transfer learning, and feature augmentation. In recent years, diffusion models have shown an impressive generation ability in many sub-problems of deep computer vision. However, its powerful generation has not been explored in long-tailed problems. We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR), as a feature augmentation method to tackle the issue. First, we encode the imbalanced dataset into features using the baseline model. Then, we train a Denoising Diffusion Implicit Model (DDIM) using these encoded features to generate pseudo-features. Finally, we train the classifier using the encoded and pseudo-features from the previous two steps. The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.

4/24/2024

cs.CV cs.AI

❗

Long-Tailed Anomaly Detection with Learnable Class Names

Chih-Hui Ho, Kuan-Chuan Peng, Nuno Vasconcelos

Anomaly detection (AD) aims to identify defective images and localize their defects (if any). Ideally, AD models should be able to detect defects over many image classes; without relying on hard-coded class names that can be uninformative or inconsistent across datasets; learn without anomaly supervision; and be robust to the long-tailed distributions of real-world applications. To address these challenges, we formulate the problem of long-tailed AD by introducing several datasets with different levels of class imbalance and metrics for performance evaluation. We then propose a novel method, LTAD, to detect defects from multiple and long-tailed classes, without relying on dataset class names. LTAD combines AD by reconstruction and semantic AD modules. AD by reconstruction is implemented with a transformer-based reconstruction module. Semantic AD is implemented with a binary classifier, which relies on learned pseudo class names and a pretrained foundation model. These modules are learned over two phases. Phase 1 learns the pseudo-class names and a variational autoencoder (VAE) for feature synthesis that augments the training data to combat long-tails. Phase 2 then learns the parameters of the reconstruction and classification modules of LTAD. Extensive experiments using the proposed long-tailed datasets show that LTAD substantially outperforms the state-of-the-art methods for most forms of dataset imbalance. The long-tailed dataset split is available at https://zenodo.org/records/10854201 .

4/1/2024

cs.CV