Large-scale Dataset Pruning with Dynamic Uncertainty

2306.05175

Published 6/17/2024 by Muyang He, Shuo Yang, Tiejun Huang, Bo Zhao

🔎

Abstract

The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. We study dataset pruning by measuring the variation of predictions during the whole training process on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 25% lossless pruning ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at https://github.com/BAAI-DCAI/Dataset-Pruning.

Create account to get full access

Overview

Researchers investigate how to prune large-scale datasets to produce informative subsets for training sophisticated deep learning models with negligible performance drop
Propose a simple yet effective dataset pruning method that explores both prediction uncertainty and training dynamics
Extensive experiments on ImageNet-1K and ImageNet-21K datasets, as well as advanced models like Swin Transformer and ConvNeXt, show 25% lossless pruning ratio

Plain English Explanation

The state-of-the-art performance in many machine learning tasks, such as image classification, has been driven by the use of larger and larger datasets and more complex deep learning models. However, this comes at a significant computational cost that can be prohibitive. This paper explores a way to prune large datasets to create a smaller, informative subset that can be used to train sophisticated deep learning models with negligible drop in performance.

The researchers propose a simple yet effective method that looks at two key factors: the uncertainty in the model's predictions and how the model's performance changes during the training process. By analyzing these aspects, they can identify which data samples are most informative and which can be safely removed from the training set without hurting the model's performance.

The team tested their approach on the large ImageNet-1K and ImageNet-21K datasets, using advanced deep learning models like Swin Transformer and ConvNeXt. The results show they can prune the datasets by 25% without losing any performance, significantly reducing the computational cost required to train these models.

Technical Explanation

The researchers propose a dataset pruning method that leverages both the prediction uncertainty and training dynamics of deep learning models. For prediction uncertainty, they analyze the variation in the model's predictions for each data sample throughout the training process. Samples with high prediction uncertainty are considered less informative and can be removed.

To capture training dynamics, the team tracks how the model's performance changes for each sample as training progresses. Samples that have a smaller impact on the overall model performance are deemed less essential and can be pruned away.

The researchers evaluated their pruning method on the large-scale ImageNet-1K and ImageNet-21K datasets, using advanced models like Swin Transformer and ConvNeXt. Their experiments show that they can prune up to 25% of the data without any loss in model performance, significantly reducing the computational resources required.

Critical Analysis

The paper provides a compelling and practical approach to dataset pruning, which is an important challenge as deep learning models become increasingly large and computationally expensive to train. The proposed method is relatively simple to implement and the results demonstrate its effectiveness on large-scale datasets and advanced architectures.

However, the paper does not explore the potential biases or fairness implications of dataset pruning. It's possible that the pruning process could disproportionately remove certain types of data samples, which could lead to biased models. Additionally, the paper does not consider the impact of dataset pruning on model generalization or robustness.

Further research could investigate these important aspects, as well as extending the pruning approach to other domains beyond computer vision. It would also be valuable to understand how the pruning method performs compared to other dataset reduction techniques, such as active learning or data selection.

Conclusion

This paper presents a simple yet effective method for pruning large-scale datasets, which can significantly reduce the computational cost of training sophisticated deep learning models with negligible performance drop. The approach leverages both prediction uncertainty and training dynamics to identify the most informative data samples, allowing for up to 25% of the dataset to be pruned without any loss in accuracy.

While the paper does not address potential biases or broader implications of dataset pruning, the proposed technique offers a promising path forward for making large-scale deep learning more efficient and accessible. As the field continues to push the boundaries of model complexity and dataset size, tools like this will become increasingly important for balancing performance and computational constraints.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Robust Data Pruning: Uncovering and Overcoming Implicit Bias

Artem Vysogorets, Kartik Ahuja, Julia Kempe

In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. At the same time, we argue that random data pruning with appropriate class ratios has potential to improve the worst-class performance. We propose a fairness-aware approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving robustness at a tolerable drop of average performance as we prune more from the datasets. We present theoretical analysis of the classification risk in a mixture of Gaussians to further motivate our algorithm and support our findings.

4/9/2024

cs.LG cs.CV

A Study in Dataset Pruning for Image Super-Resolution

Brian B. Moser, Federico Raue, Andreas Dengel

In image Super-Resolution (SR), relying on large datasets for training is a double-edged sword. While offering rich training material, they also demand substantial computational and storage resources. In this work, we analyze dataset pruning to solve these challenges. We introduce a novel approach that reduces a dataset to a core-set of training samples, selected based on their loss values as determined by a simple pre-trained SR model. By focusing the training on just 50% of the original dataset, specifically on the samples characterized by the highest loss values, we achieve results comparable to or surpassing those obtained from training on the entire dataset. Interestingly, our analysis reveals that the top 5% of samples with the highest loss values negatively affect the training process. Excluding these samples and adjusting the selection to favor easier samples further enhances training outcomes. Our work opens new perspectives to the untapped potential of dataset pruning in image SR. It suggests that careful selection of training data based on loss-value metrics can lead to better SR models, challenging the conventional wisdom that more data inevitably leads to better performance.

6/11/2024

eess.IV cs.AI cs.CV cs.GR cs.LG

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

Bjorn Nieth, Thomas Altstidl, Leo Schwinn, Bjorn Eskofier

Their vulnerability to small, imperceptible attacks limits the adoption of deep learning models to real-world systems. Adversarial training has proven to be one of the most promising strategies against these attacks, at the expense of a substantial increase in training time. With the ongoing trend of integrating large-scale synthetic data this is only expected to increase even further. Thus, the need for data-centric approaches that reduce the number of training samples while maintaining accuracy and robustness arises. While data pruning and active learning are prominent research topics in deep learning, they are as of now largely unexplored in the adversarial training literature. We address this gap and propose a new data pruning strategy based on extrapolating data importance scores from a small set of data to a larger set. In an empirical evaluation, we demonstrate that extrapolation-based pruning can efficiently reduce dataset size while maintaining robustness.

6/21/2024

cs.LG

Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning

Xin Zhang, Jiawei Du, Yunsong Li, Weiying Xie, Joey Tianyi Zhou

Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original, full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples, often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered, including factors such as forgetting event and probability change, typically using an averaging approach. However, these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples, which may not be sufficiently highlighted in an averaging manner. In this study, we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth, we estimate the series of each sample's individual contributions spanning the training progress, ensuring comprehensive integration of training dynamics. In the second depth, we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.

5/29/2024

cs.CV cs.LG