Calibrated Dataset Condensation for Faster Hyperparameter Search

2405.17535

Published 5/29/2024 by Mucong Ding, Yuancheng Xu, Tahseen Rabbani, Xiaoyu Liu, Brian Gravelle, Teresa Ranadive, Tai-Ching Tuan, Furong Huang

cs.LG cs.AI stat.ML

Calibrated Dataset Condensation for Faster Hyperparameter Search

Abstract

Dataset condensation can be used to reduce the computational cost of training multiple models on a large dataset by condensing the training dataset into a small synthetic set. State-of-the-art approaches rely on matching the model gradients between the real and synthetic data. However, there is no theoretical guarantee of the generalizability of the condensed data: data condensation often generalizes poorly across hyperparameters/architectures in practice. This paper considers a different condensation objective specifically geared toward hyperparameter search. We aim to generate a synthetic validation dataset so that the validation-performance rankings of the models, with different hyperparameters, on the condensed and original datasets are comparable. We propose a novel hyperparameter-calibrated dataset condensation (HCDC) algorithm, which obtains the synthetic validation dataset by matching the hyperparameter gradients computed via implicit differentiation and efficient inverse Hessian approximation. Experiments demonstrate that the proposed framework effectively maintains the validation-performance rankings of models and speeds up hyperparameter/architecture search for tasks on both images and graphs.

Create account to get full access

Overview

This paper proposes a new method called "Calibrated Dataset Condensation" to quickly find the best hyperparameters for a machine learning model.
Hyperparameter tuning is a crucial but time-consuming step in training machine learning models. The authors introduce a way to drastically reduce the time needed for this process.
The method involves condensing a full dataset into a much smaller "synthetic" dataset that can be used to evaluate hyperparameter performance nearly as effectively as the original data.

Plain English Explanation

The process of training a machine learning model often requires carefully selecting the right "hyperparameters" - settings that control how the model learns from data. Finding the optimal hyperparameters can take a long time, as it involves training the model repeatedly on the full dataset to see how different settings perform.

To speed up this hyperparameter search, the researchers developed a new method called "Calibrated Dataset Condensation." <a href="https://aimodels.fyi/papers/arxiv/elucidating-design-space-dataset-condensation">This builds on previous work on dataset condensation</a>, which aims to create a much smaller "synthetic" dataset that can substitute for the original training data.

The key innovation in this paper is "calibration" - adjusting the synthetic dataset to ensure it provides an accurate estimate of how well different hyperparameters will perform on the full data. This allows the hyperparameter search to be done very quickly on the small synthetic dataset, without sacrificing the quality of the final model.

Technical Explanation

The core idea behind the Calibrated Dataset Condensation method is to start with the standard dataset condensation approach, which learns a small set of synthetic data points that can be used in place of the full training dataset. <a href="https://aimodels.fyi/papers/arxiv/multisize-dataset-condensation">This builds on prior work on multi-size dataset condensation</a> to create synthetic datasets of different sizes.

However, the authors found that the standard condensation method did not provide reliable estimates of how well different hyperparameters would perform on the full training data. To address this, they introduce a "calibration" step that adjusts the synthetic dataset to better match the statistical properties of the original data.

Specifically, the calibration process involves estimating the training loss and model performance on the full dataset based on the synthetic dataset alone. It then updates the synthetic data points to minimize the error in these estimates, ensuring the condensed dataset provides an accurate proxy for evaluating hyperparameters.

The authors demonstrate that this Calibrated Dataset Condensation method allows for much faster hyperparameter optimization, while maintaining the same final model performance as training on the full original dataset. <a href="https://aimodels.fyi/papers/arxiv/gcondenser-benchmarking-graph-condensation">This builds on work on graph condensation</a> to enable similar data efficiency gains in other domains.

Critical Analysis

The key innovation of this work is the calibration step, which addresses a major limitation of prior dataset condensation methods - their inability to reliably estimate model performance on the full data. By incorporating this calibration, the authors are able to achieve significant speedups in hyperparameter search without sacrificing final model quality.

That said, the calibration process adds some computational overhead, so there may be a tradeoff between the speed of the hyperparameter search and the time required for the calibration step. The authors note that the calibration can be done in parallel, which helps mitigate this issue.

Additionally, the effectiveness of the method likely depends on the complexity of the hyperparameter space and the dataset. Highly complex or high-dimensional hyperparameter spaces may be more challenging to accurately calibrate. <a href="https://aimodels.fyi/papers/arxiv/dataset-condensation-driven-machine-unlearning">Further research could explore combining this with other dataset condensation techniques</a> to handle a broader range of scenarios.

Overall, this work represents an important step forward in accelerating the critical hyperparameter tuning process for machine learning models. The calibrated condensation approach provides a promising path to make hyperparameter search much more efficient.

Conclusion

The Calibrated Dataset Condensation method introduced in this paper offers a way to drastically speed up the hyperparameter tuning process for machine learning models. By creating a small synthetic dataset that accurately captures the performance characteristics of the full training data, the authors enable hyperparameter search to be done much more quickly.

This has significant practical implications, as hyperparameter tuning is often a major bottleneck in developing high-performing machine learning systems. <a href="https://aimodels.fyi/papers/arxiv/koopcon-new-approach-towards-smarter-less-complex">By making this process more efficient, the Calibrated Dataset Condensation method could help drive wider adoption of advanced machine learning techniques in real-world applications.</a>

Overall, this research represents an important contribution to the field of machine learning, offering a novel solution to a longstanding challenge. As machine learning models become increasingly complex, tools like this that improve the efficiency of model development will only grow in importance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Elucidating the Design Space of Dataset Condensation

Shitong Shao, Zikai Zhou, Huanran Chen, Zhiqiang Shen

Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.

5/7/2024

cs.LG cs.AI cs.CV

Multisize Dataset Condensation

Yang He, Lingao Xiao, Joey Tianyi Zhou, Ivor Tsang

While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the subset degradation problem in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an adaptive subset loss on top of the basic condensation loss to mitigate the subset degradation problem. Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class. Code is available at: https://github.com/he-y/Multisize-Dataset-Condensation.

4/16/2024

cs.CV

Ameliorate Spurious Correlations in Dataset Condensation

Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh

Dataset Condensation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset condensation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the condensation process, resulting in a notable decline in the performance of models trained on the condensed dataset, while corruption bias is suppressed through the condensation process. To reduce bias amplification in dataset condensation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset condensation and provide a promising avenue to address bias amplification in the process.

6/12/2024

cs.LG cs.AI cs.CV

Dataset Condensation for Time Series Classification via Dual Domain Matching

Zhanyu Liu, Ke Hao, Guanjie Zheng, Yanwei Yu

Time series data has been demonstrated to be crucial in various research fields. The management of large quantities of time series data presents challenges in terms of deep learning tasks, particularly for training a deep neural network. Recently, a technique named textit{Dataset Condensation} has emerged as a solution to this problem. This technique generates a smaller synthetic dataset that has comparable performance to the full real dataset in downstream tasks such as classification. However, previous methods are primarily designed for image and graph datasets, and directly adapting them to the time series dataset leads to suboptimal performance due to their inability to effectively leverage the rich information inherent in time series data, particularly in the frequency domain. In this paper, we propose a novel framework named Dataset textit{textbf{Cond}}ensation for textit{textbf{T}}ime textit{textbf{S}}eries textit{textbf{C}}lassification via Dual Domain Matching (textbf{CondTSC}) which focuses on the time series classification dataset condensation task. Different from previous methods, our proposed framework aims to generate a condensed dataset that matches the surrogate objectives in both the time and frequency domains. Specifically, CondTSC incorporates multi-view data augmentation, dual domain training, and dual surrogate objectives to enhance the dataset condensation process in the time and frequency domains. Through extensive experiments, we demonstrate the effectiveness of our proposed framework, which outperforms other baselines and learns a condensed synthetic dataset that exhibits desirable characteristics such as conforming to the distribution of the original data.

6/11/2024

cs.LG