Corruptions of Supervised Learning Problems: Typology and Mitigations

Read original: arXiv:2307.08643 - Published 5/6/2024 by Laura Iacovissi, Nan Lu, Robert C. Williamson

👨‍🏫

Overview

This paper presents a comprehensive theory of corruption in machine learning from an information-theoretic perspective.
It generalizes the definition of corruption beyond distributional shift, including changes in model class and loss function.
The paper develops a framework for studying pairwise Markovian corruptions, analyzes the impact of corruption on learning tasks, and investigates mitigation strategies.

Plain English Explanation

The paper tackles the widespread problem of corruption in data collection for machine learning models. Despite extensive research, the existing literature has largely focused on specific settings and learning scenarios, lacking a unified view.

The researchers develop a general theory of corruption using information-theory concepts, with Markov kernels as a foundational mathematical tool. They generalize the definition of corruption beyond just distributional shift, to include changes in the model class and loss function as well.

First, the researchers construct a comprehensive framework for studying pairwise Markovian corruptions. This allows them to categorize different types of corruption based on their impact on the input space, and also unifies prior work on specific corruption models under a consistent terminology.

Next, the paper systematically analyzes how corruption affects learning tasks by comparing the Bayes risk (a measure of the best possible performance) between clean and corrupted scenarios. This sheds light on the complexities that arise from joint and dependent corruptions on both labels and attributes.

Notably, the researchers find that while label corruptions only affect the loss function, attribute corruptions can also impact the hypothesis class (the set of possible models) that the learning algorithm can consider.

Finally, the paper investigates mitigation strategies for various corruption types. It expands on existing loss-correction results for label corruption and identifies the need to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements. Within this new setting, the researchers provide a negative result, showing the inability to perform loss correction for attribute and joint corruptions.

Technical Explanation

The paper begins by constructing a provably exhaustive framework for pairwise Markovian corruptions. This framework not only allows the researchers to study corruption types based on their impact on the input space, but also serves to unify prior works on specific corruption models and establish a consistent nomenclature.

The researchers then systematically analyze the consequences of corruption on learning tasks by comparing the Bayes risks in the clean and corrupted scenarios. This examination reveals the complexities arising from joint and dependent corruptions on both labels and attributes. Notably, while label corruptions only affect the loss function, attribute corruptions extend the influence beyond the loss to also affect the hypothesis class.

Building upon these results, the paper investigates mitigation strategies for various corruption types. It expands the existing loss-correction results for label corruption and identifies the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements. Within this new setting, the researchers provide a negative result, showing the inability to perform loss correction for attribute and joint corruptions.

Critical Analysis

The paper presents a comprehensive and principled approach to understanding corruption in machine learning, which is a critical issue in the field. By generalizing the definition of corruption and developing a unifying framework, the researchers have laid the groundwork for a more systematic study of this problem.

One potential limitation of the work is the focus on pairwise Markovian corruptions. While this provides a solid foundation, real-world corruption scenarios may involve more complex, higher-order dependencies that are not captured by this model. Additionally, the negative result for loss correction in the attribute and joint corruption cases suggests that more advanced mitigation strategies may be necessary.

Further research could explore the interplay between corruption and other aspects of machine learning, such as the impact of distribution shifts on model robustness or the role of Byzantine-robust optimization in defending against data poisoning. Techniques like tabular data contrastive learning or deconstructing context learning may also offer promising avenues for mitigating the effects of corruption.

Conclusion

This paper presents a significant advancement in the understanding and modeling of corruption in machine learning. By developing a comprehensive theoretical framework and analyzing the impact of corruption on learning tasks, the researchers have laid the groundwork for more robust and reliable machine learning systems.

The negative results on loss correction for attribute and joint corruptions suggest that new and more sophisticated mitigation strategies will be necessary to tackle the complex challenges posed by real-world data corruption. As the field continues to evolve, this work will serve as an important foundation for future research in this critical area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👨‍🏫

Corruptions of Supervised Learning Problems: Typology and Mitigations

Laura Iacovissi, Nan Lu, Robert C. Williamson

Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature on corruption predominantly focuses on specific settings and learning scenarios, lacking a unified view. There is still a limited understanding of how to effectively model and mitigate corruption in machine learning problems. In this work, we develop a general theory of corruption from an information-theoretic perspective - with Markov kernels as a foundational mathematical tool. We generalize the definition of corruption beyond the concept of distributional shift: corruption includes all modifications of a learning problem, including changes in model class and loss function. We will focus here on changes in probability distributions. First, we construct a provably exhaustive framework for pairwise Markovian corruptions. The framework not only allows us to study corruption types based on their input space, but also serves to unify prior works on specific corruption models and establish a consistent nomenclature. Second, we systematically analyze the consequences of corruption on learning tasks by comparing Bayes risks in the clean and corrupted scenarios. This examination sheds light on complexities arising from joint and dependent corruptions on both labels and attributes. Notably, while label corruptions affect only the loss function, more intricate cases involving attribute corruptions extend the influence beyond the loss to affect the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand the existing loss-correction results for label corruption, and identify the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements. Within the latter setting, we provide a negative result for loss correction in the attribute and the joint corruption case.

5/6/2024

👀

A Survey on the Robustness of Computer Vision Models against Common Corruptions

Shunxin Wang, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio

The performance of computer vision models are susceptible to unexpected changes in input images caused by sensor errors or extreme imaging environments, known as common corruptions (e.g. noise, blur, illumination changes). These corruptions can significantly hinder the reliability of these models when deployed in real-world scenarios, yet they are often overlooked when testing model generalization and robustness. In this survey, we present a comprehensive overview of methods that improve the robustness of computer vision models against common corruptions. We categorize methods into three groups based on the model components and training methods they target: data augmentation, learning strategies, and network components. We release a unified benchmark framework (available at url{https://github.com/nis-research/CorruptionBenchCV}) to compare robustness performance across several datasets, and we address the inconsistencies of evaluation practices in the literature. Our experimental analysis highlights the base corruption robustness of popular vision backbones, revealing that corruption robustness does not necessarily scale with model size and data size. Large models gain negligible robustness improvements, considering the increased computational requirements. To achieve generalizable and robust computer vision models, we foresee the need of developing new learning strategies that efficiently exploit limited data and mitigate unreliable learning behaviors.

9/17/2024

📊

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

6/3/2024

👀

Multigroup Robustness

Lunjia Hu, Charlotte Peale, Judy Hanwen Shen

To address the shortcomings of real-world datasets, robust learning algorithms have been designed to overcome arbitrary and indiscriminate data corruption. However, practical processes of gathering data may lead to patterns of data corruption that are localized to specific partitions of the training dataset. Motivated by critical applications where the learned model is deployed to make predictions about people from a rich collection of overlapping subpopulations, we initiate the study of multigroup robust algorithms whose robustness guarantees for each subpopulation only degrade with the amount of data corruption inside that subpopulation. When the data corruption is not distributed uniformly over subpopulations, our algorithms provide more meaningful robustness guarantees than standard guarantees that are oblivious to how the data corruption and the affected subpopulations are related. Our techniques establish a new connection between multigroup fairness and robustness.

5/2/2024