SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

Read original: arXiv:2406.18561 - Published 6/28/2024 by Yongmin Lee, Hye Won Chung

SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

Overview

SelMatch proposes a novel method for effectively scaling up dataset distillation, which aims to generate a small synthetic dataset that can train a model to perform as well as one trained on the full original dataset.
The key innovations are selection-based initialization and partial updates by trajectory matching, which improve the efficiency and scalability of dataset distillation.
The method demonstrates strong performance on various image classification benchmarks, outperforming previous dataset distillation approaches.

Plain English Explanation

Dataset distillation is a technique that allows you to create a small, synthetic dataset that can train a model to perform as well as one trained on the full original dataset. This is useful when the original dataset is very large or difficult to work with.

SelMatch is a new method that improves on previous dataset distillation approaches. It has two main innovations:

Selection-based initialization: Instead of randomly initializing the synthetic dataset, SelMatch selects a subset of the original training data as the starting point. This gives the synthetic dataset a "head start" and helps it converge to a good solution more efficiently.
Partial updates by trajectory matching: Rather than updating the entire synthetic dataset at once, SelMatch updates only a subset of the data at each step. It does this by matching the "trajectory" (the path the synthetic data takes during training) to the trajectory of the original data. This makes the updates more targeted and effective.

These innovations allow SelMatch to scale up dataset distillation to larger and more complex datasets, outperforming previous methods. This could be useful for a variety of applications, such as image distillation for safe data sharing in histopathology, improved distribution matching for fast image synthesis, or multi-step distillation of diffusion models.

Technical Explanation

SelMatch introduces two key innovations to improve the efficiency and scalability of dataset distillation:

Selection-based initialization: Instead of randomly initializing the synthetic dataset, SelMatch selects a subset of the original training data as the starting point. This is done by training a simple model on the full dataset and then selecting the data points that the model is most confident about. This "seeding" the synthetic dataset with informative data points helps it converge to a good solution more efficiently.
Partial updates by trajectory matching: Rather than updating the entire synthetic dataset at once, SelMatch updates only a subset of the data at each step. It does this by matching the "trajectory" (the path the synthetic data takes during training) to the trajectory of the original data. Specifically, SelMatch computes the gradient of the synthetic dataset with respect to the model parameters, and then updates only the subset of the synthetic data that has the highest gradient magnitudes. This makes the updates more targeted and effective.

The paper evaluates SelMatch on various image classification benchmarks, including CIFAR-10, CIFAR-100, and ImageNet. The results show that SelMatch outperforms previous dataset distillation approaches, achieving higher model performance with a smaller synthetic dataset. For example, on CIFAR-10, SelMatch can train a model to 93% accuracy using just 500 synthetic images, compared to previous methods that required 5,000 images to achieve similar performance.

Critical Analysis

The SelMatch paper makes a compelling case for its effectiveness in scaling up dataset distillation. The key innovations of selection-based initialization and partial updates by trajectory matching appear to be well-designed and grounded in sound principles.

That said, the paper does not address some potential limitations and areas for further research:

Generalization to other domains: The evaluation is primarily focused on image classification tasks. It would be valuable to see how SelMatch performs on other types of data, such as text, audio, or tabular datasets, to assess its broader applicability.
Computational complexity: While SelMatch is more efficient than previous methods, the paper does not provide a detailed analysis of its computational complexity. As dataset sizes and model complexities continue to grow, the scalability of the approach may become an important consideration.
Robustness to noisy or adversarial data: The paper does not explore how SelMatch might perform in the presence of noisy or adversarial data in the original dataset. This is an important consideration for real-world applications, where data quality can be a concern.
Interpretability and explainability: The paper does not provide much insight into why the selection-based initialization and partial updates are effective. A deeper understanding of the underlying mechanisms could lead to further improvements or inform the application of these techniques to other problems.

Despite these potential areas for further research, the SelMatch paper represents a significant advancement in the field of dataset distillation and could have important implications for a wide range of data-efficient learning and data sharing applications.

Conclusion

SelMatch is a novel dataset distillation method that introduces two key innovations - selection-based initialization and partial updates by trajectory matching - to improve the efficiency and scalability of generating small, synthetic datasets that can train models as effectively as the full original dataset.

The paper demonstrates strong performance across various image classification benchmarks, outperforming previous dataset distillation approaches. While there are some potential limitations and areas for further research, SelMatch represents a significant advancement in the field and could have important implications for a wide range of applications that require data-efficient learning or secure data sharing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

Yongmin Lee, Hye Won Chung

Dataset distillation aims to synthesize a small number of images per class (IPC) from a large dataset to approximate full dataset training with minimal performance loss. While effective in very small IPC ranges, many distillation methods become less effective, even underperforming random sample selection, as IPC increases. Our examination of state-of-the-art trajectory-matching based distillation methods across various IPC scales reveals that these methods struggle to incorporate the complex, rare features of harder samples into the synthetic dataset even with the increased IPC, resulting in a persistent coverage gap between easy and hard test samples. Motivated by such observations, we introduce SelMatch, a novel distillation method that effectively scales with IPC. SelMatch uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset's desired difficulty level tailored to IPC scales. When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods across subset ratios from 5% to 30%.

6/28/2024

Data-Efficient Generation for Dataset Distillation

Zhe Li, Weitong Zhang, Sarah Cechnicka, Bernhard Kainz

While deep learning techniques have proven successful in image-related tasks, the exponentially increased data storage and computation costs become a significant challenge. Dataset distillation addresses these challenges by synthesizing only a few images for each class that encapsulate all essential information. Most current methods focus on matching. The problems lie in the synthetic images not being human-readable and the dataset performance being insufficient for downstream learning tasks. Moreover, the distillation time can quickly get out of bounds when the number of synthetic images per class increases even slightly. To address this, we train a class conditional latent diffusion model capable of generating realistic synthetic images with labels. The sampling time can be reduced to several tens of images per seconds. We demonstrate that models can be effectively trained using only a small set of synthetic images and evaluated on a large real test set. Our approach achieved rank (1) in The First Dataset Distillation Challenge at ECCV 2024 on the CIFAR100 and TinyImageNet datasets.

9/9/2024

Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation

Shaobo Wang, Yantai Yang, Qilong Wang, Kaixin Li, Linfeng Zhang, Junchi Yan

Dataset Distillation (DD) aims to synthesize a small dataset capable of performing comparably to the original dataset. Despite the success of numerous DD methods, theoretical exploration of this area remains unaddressed. In this paper, we take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We begin by empirically examining sample difficulty, measured by gradient norm, and observe that different matching-based methods roughly correspond to specific difficulty tendencies. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. Our findings suggest that prioritizing the synthesis of easier samples from the original dataset can enhance the quality of distilled datasets, especially in low IPC (image-per-class) settings. Based on our empirical observations and theoretical analysis, we introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality. Our SDC can be seamlessly integrated into existing methods as a plugin with minimal code adjustments. Experimental results demonstrate that adding SDC generates higher-quality distilled datasets across 7 distillation methods and 6 datasets.

8/23/2024

Dataset Distillation by Automatic Training Trajectories

Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, Martin Schulz

Dataset Distillation is used to create a concise, yet informative, synthetic dataset that can replace the original dataset for training purposes. Some leading methods in this domain prioritize long-range matching, involving the unrolling of training trajectories with a fixed number of steps (NS) on the synthetic dataset to align with various expert training trajectories. However, traditional long-range matching methods possess an overfitting-like problem, the fixed step size NS forces synthetic dataset to distortedly conform seen expert training trajectories, resulting in a loss of generality-especially to those from unencountered architecture. We refer to this as the Accumulated Mismatching Problem (AMP), and propose a new approach, Automatic Training Trajectories (ATT), which dynamically and adaptively adjusts trajectory length NS to address the AMP. Our method outperforms existing methods particularly in tests involving cross-architectures. Moreover, owing to its adaptive nature, it exhibits enhanced stability in the face of parameter variations.

7/22/2024