Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

2406.13283

Published 6/21/2024 by Bjorn Nieth, Thomas Altstidl, Leo Schwinn, Bjorn Eskofier

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

Abstract

Their vulnerability to small, imperceptible attacks limits the adoption of deep learning models to real-world systems. Adversarial training has proven to be one of the most promising strategies against these attacks, at the expense of a substantial increase in training time. With the ongoing trend of integrating large-scale synthetic data this is only expected to increase even further. Thus, the need for data-centric approaches that reduce the number of training samples while maintaining accuracy and robustness arises. While data pruning and active learning are prominent research topics in deep learning, they are as of now largely unexplored in the adversarial training literature. We address this gap and propose a new data pruning strategy based on extrapolating data importance scores from a small set of data to a larger set. In an empirical evaluation, we demonstrate that extrapolation-based pruning can efficiently reduce dataset size while maintaining robustness.

Create account to get full access

Overview

This paper proposes a novel data pruning method for large-scale adversarial training, which aims to improve model robustness by selectively removing less important data points.
The key idea is to extrapolate the importance of data points based on their influence on the model's predictions, rather than relying on expensive retraining.
The authors demonstrate that their approach, called Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation, outperforms existing data pruning methods in terms of improving model robustness while maintaining high clean accuracy.

Plain English Explanation

When training machine learning models, the dataset used can have a significant impact on the model's performance. Some data points may be more important than others in helping the model learn effectively. The paper introduces a new method to identify and remove less important data points, a process known as data pruning.

The researchers' approach focuses on adversarial training, which is a technique used to make models more robust to attacks that try to fool the model by making small, imperceptible changes to the input. Their method, called "data importance extrapolation," estimates the importance of each data point based on how much it influences the model's predictions, without the need for expensive retraining.

By selectively removing the less important data points, the researchers show that they can improve the model's robustness to adversarial attacks while still maintaining high accuracy on regular, non-attacked inputs. This is a valuable capability, as adversarial attacks can be a significant challenge for many machine learning systems.

The researchers compare their method to other data pruning techniques and demonstrate its superiority in terms of improving model robustness. This work contributes to the ongoing efforts to make machine learning models more reliable and secure, which is crucial as these systems are increasingly deployed in real-world applications.

Technical Explanation

The paper presents a novel data pruning method for large-scale adversarial training, called Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation. The key idea is to extrapolate the importance of data points based on their influence on the model's predictions, rather than relying on expensive retraining.

The authors first introduce a framework for estimating the importance of data points based on their impact on the model's loss and gradients. They then propose an efficient extrapolation technique to quickly estimate the importance of all data points, without the need for expensive retraining.

The pruning process involves iteratively removing the least important data points from the training set, while monitoring the model's performance on a held-out validation set. The authors demonstrate that their approach outperforms existing data pruning methods, such as Robust Data Pruning: Uncovering and Overcoming Implicit Bias, PUMA: Margin-Based Data Pruning, and Boosting Model Resilience via Implicit Adversarial Data, in terms of improving model robustness while maintaining high clean accuracy.

The authors also conduct a comprehensive evaluation, including a study on dataset pruning for image super-resolution, to further validate the effectiveness of their approach.

Critical Analysis

The paper presents a promising approach for data pruning in the context of large-scale adversarial training. The authors' key contribution is the development of an efficient data importance extrapolation technique, which avoids the need for expensive retraining during the pruning process.

One potential limitation of the approach is that it relies on the assumption that the importance of a data point can be accurately estimated based on its influence on the model's predictions. In practice, this assumption may not always hold, especially for more complex models or datasets. The authors acknowledge this limitation and suggest that further research is needed to address it.

Additionally, the paper does not explore the impact of the pruning method on the model's generalization to out-of-distribution data or its ability to learn from diverse datasets. These aspects could be important considerations for the broader applicability of the proposed approach.

Overall, the paper presents a valuable contribution to the field of adversarial training and data pruning. The authors' work demonstrates the potential for improving model robustness through selective data removal, and their findings could inspire further research in this direction. Readers are encouraged to critically evaluate the method and its limitations, and to consider how it might be extended or improved upon in future studies.

Conclusion

The paper introduces a novel data pruning method for large-scale adversarial training, called Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation. The key innovation is the use of an efficient data importance extrapolation technique, which allows the researchers to selectively remove less important data points without the need for expensive retraining.

The authors show that their approach outperforms existing data pruning methods in terms of improving model robustness to adversarial attacks, while maintaining high clean accuracy. This work contributes to the ongoing efforts to make machine learning models more reliable and secure, which is crucial as these systems are increasingly deployed in real-world applications.

While the paper presents a promising solution, the authors acknowledge the need for further research to address potential limitations, such as the accuracy of the data importance estimation and the impact on model generalization. Nonetheless, this work represents an important step forward in the field of adversarial training and data pruning, and it is likely to inspire new ideas and avenues of exploration in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

Large-scale Dataset Pruning with Dynamic Uncertainty

Muyang He, Shuo Yang, Tiejun Huang, Bo Zhao

The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. We study dataset pruning by measuring the variation of predictions during the whole training process on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 25% lossless pruning ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at https://github.com/BAAI-DCAI/Dataset-Pruning.

6/17/2024

cs.LG cs.CV

Robust Data Pruning: Uncovering and Overcoming Implicit Bias

Artem Vysogorets, Kartik Ahuja, Julia Kempe

In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. At the same time, we argue that random data pruning with appropriate class ratios has potential to improve the worst-class performance. We propose a fairness-aware approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving robustness at a tolerable drop of average performance as we prune more from the datasets. We present theoretical analysis of the classification risk in a mixture of Gaussians to further motivate our algorithm and support our findings.

4/9/2024

cs.LG cs.CV

PUMA: margin-based data pruning

Javier Maroto, Pascal Frossard

Deep learning has been able to outperform humans in terms of classification accuracy in many tasks. However, to achieve robustness to adversarial perturbations, the best methodologies require to perform adversarial training on a much larger training set that has been typically augmented using generative models (e.g., diffusion models). Our main objective in this work, is to reduce these data requirements while achieving the same or better accuracy-robustness trade-offs. We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin). We find that the existing approaches that prune samples with low margin fails to increase robustness when we add a lot of synthetic data, and explain this situation with a perceptron learning task. Moreover, we find that pruning high margin samples for better accuracy increases the harmful impact of mislabeled perturbed data in adversarial training, hurting both robustness and accuracy. We thus propose PUMA, a new data pruning strategy that computes the margin using DeepFool, and prunes the training samples of highest margin without hurting performance by jointly adjusting the training attack norm on the samples of lowest margin. We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies. Not only PUMA achieves similar robustness with less data, but it also significantly increases the model accuracy, improving the performance trade-off.

5/13/2024

cs.LG

Expansive Synthesis: Generating Large-Scale Datasets from Minimal Samples

Vahid Jebraeeli, Bo Jiang, Hamid Krim, Derya Cansever

The challenge of limited availability of data for training in machine learning arises in many applications and the impact on performance and generalization is serious. Traditional data augmentation methods aim to enhance training with a moderately sufficient data set. Generative models like Generative Adversarial Networks (GANs) often face problematic convergence when generating significant and diverse data samples. Diffusion models, though effective, still struggle with high computational cost and long training times. This paper introduces an innovative Expansive Synthesis model that generates large-scale, high-fidelity datasets from minimal samples. The proposed approach exploits expander graph mappings and feature interpolation to synthesize expanded datasets while preserving the intrinsic data distribution and feature structural relationships. The rationale of the model is rooted in the non-linear property of neural networks' latent space and in its capture by a Koopman operator to yield a linear space of features to facilitate the construction of larger and enriched consistent datasets starting with a much smaller dataset. This process is optimized by an autoencoder architecture enhanced with self-attention layers and further refined for distributional consistency by optimal transport. We validate our Expansive Synthesis by training classifiers on the generated datasets and comparing their performance to classifiers trained on larger, original datasets. Experimental results demonstrate that classifiers trained on synthesized data achieve performance metrics on par with those trained on full-scale datasets, showcasing the model's potential to effectively augment training data. This work represents a significant advancement in data generation, offering a robust solution to data scarcity and paving the way for enhanced data availability in machine learning applications.

6/26/2024

cs.LG cs.CV eess.IV