Dataset Distillation by Automatic Training Trajectories

Read original: arXiv:2407.14245 - Published 7/22/2024 by Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, Martin Schulz

Dataset Distillation by Automatic Training Trajectories

Overview

Dataset Distillation is a technique to compress large datasets into a smaller set of synthetic datapoints that can effectively train machine learning models.
This paper proposes a new method called "Dataset Distillation by Automatic Training Trajectories" that can automatically discover these synthetic datapoints.
The key idea is to learn the training dynamics of the model, then use that to generate datapoints that mimic the model's learning process.

Plain English Explanation

The goal of Dataset Distillation is to take a large dataset and compress it down into a much smaller set of synthetic datapoints. These synthetic points can then be used to train a machine learning model just as effectively as the original full dataset.

This paper introduces a new approach called "Dataset Distillation by Automatic Training Trajectories" to discover these synthetic datapoints. The key insight is that you can learn how the model learns by tracking its training dynamics. Once you understand the model's learning process, you can generate new synthetic datapoints that mimic that process, allowing the model to learn just as effectively from the small synthetic dataset as the full original one.

This is like distilling a complex wine or spirit - you take the essential elements that contain the core flavor and concentrate them down into a smaller, more efficient form. Similarly, Dataset Distillation extracts the core information from a large dataset and packs it into a much smaller set of synthetic points.

Technical Explanation

The core of this paper is a new method for Dataset Distillation that aims to automatically discover the optimal set of synthetic datapoints.

The key steps are:

Train the model on the full dataset and track its training dynamics - how the model's weights and outputs change over the course of training.
Use an optimization process to find a small set of synthetic datapoints that, when trained on, cause the model to follow a similar training trajectory as the full dataset.
The final synthetic dataset can then be used to train new models just as effectively as the original full dataset, but with far fewer datapoints.

The authors demonstrate this approach on image classification tasks, showing that it can match the performance of full-sized datasets using just a fraction of the datapoints. This has exciting implications for scaling up dataset distillation to work with very large datasets.

Critical Analysis

One key limitation mentioned in the paper is that the optimization process to find the synthetic datapoints can be computationally expensive, especially as the number of synthetic points increases. The authors suggest exploring more efficient optimization techniques as an area for future work.

Additionally, the paper only evaluates this method on image classification tasks. It would be valuable to see how well it generalizes to other domains, such as language or tabular data.

Overall, this is an interesting and promising approach to the important problem of dataset compression. Further research is needed to improve the efficiency and broaden the applicability of this technique.

Conclusion

This paper introduces a new method for Dataset Distillation that can automatically discover a small set of synthetic datapoints that can train machine learning models just as effectively as the original full dataset.

The key insight is to learn the training dynamics of the model and then generate synthetic datapoints that mimic that learning process. This allows the model to extract the essential information from the data in a compressed form.

If this approach can be scaled and generalized further, it could have significant implications for reducing the data and computational requirements of modern machine learning, with benefits for efficiency, cost, and environmental impact.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Dataset Distillation by Automatic Training Trajectories

Dai Liu, Jindong Gu, Hu Cao, Carsten Trinitis, Martin Schulz

Dataset Distillation is used to create a concise, yet informative, synthetic dataset that can replace the original dataset for training purposes. Some leading methods in this domain prioritize long-range matching, involving the unrolling of training trajectories with a fixed number of steps (NS) on the synthetic dataset to align with various expert training trajectories. However, traditional long-range matching methods possess an overfitting-like problem, the fixed step size NS forces synthetic dataset to distortedly conform seen expert training trajectories, resulting in a loss of generality-especially to those from unencountered architecture. We refer to this as the Accumulated Mismatching Problem (AMP), and propose a new approach, Automatic Training Trajectories (ATT), which dynamically and adaptively adjusts trajectory length NS to address the AMP. Our method outperforms existing methods particularly in tests involving cross-architectures. Moreover, owing to its adaptive nature, it exhibits enhanced stability in the face of parameter variations.

7/22/2024

Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory

Wenliang Zhong, Haoyu Tang, Qinghai Zheng, Mingzhu Xu, Yupeng Hu, Liqiang Nie

The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT leverages insights from the linearized dynamics of Neural Tangent Kernel methods to create a convex combination of expert trajectories, guiding the student network to converge rapidly and stably. This trajectory is not only easier to store, but also enables a continuous sampling strategy during distillation, ensuring thorough learning and fitting of the entire expert trajectory. Comprehensive experiments across three public datasets validate the superiority of MCT over traditional MTT methods.

7/1/2024

Distilling Long-tailed Datasets

Zhenghao Zhao, Haoxuan Wang, Yuzhang Shang, Kai Wang, Yan Yan

Dataset distillation (DD) aims to distill a small, information-rich dataset from a larger one for efficient neural network training. However, existing DD methods struggle with long-tailed datasets, which are prevalent in real-world scenarios. By investigating the reasons behind this unexpected result, we identified two main causes: 1) Expert networks trained on imbalanced data develop biased gradients, leading to the synthesis of similarly imbalanced distilled datasets. Parameter matching, a common technique in DD, involves aligning the learning parameters of the distilled dataset with that of the original dataset. However, in the context of long-tailed datasets, matching biased experts leads to inheriting the imbalance present in the original data, causing the distilled dataset to inadequately represent tail classes. 2) The experts trained on such datasets perform suboptimally on tail classes, resulting in misguided distillation supervision and poor-quality soft-label initialization. To address these issues, we propose a novel long-tailed dataset distillation method, Long-tailed Aware Dataset distillation (LAD). Specifically, we propose Weight Mismatch Avoidance to avoid directly matching the biased expert trajectories. It reduces the distance between the student and the biased expert trajectories and prevents the tail class bias from being distilled to the synthetic dataset. Moreover, we propose Adaptive Decoupled Matching, which jointly matches the decoupled backbone and classifier to improve the tail class performance and initialize reliable soft labels. This work pioneers the field of long-tailed dataset distillation (LTDD), marking the first effective effort to distill long-tailed datasets.

8/28/2024

TrACT: A Training Dynamics Aware Contrastive Learning Framework for Long-tail Trajectory Prediction

Junrui Zhang, Mozhgan Pourkeshavarz, Amir Rasouli

As a safety critical task, autonomous driving requires accurate predictions of road users' future trajectories for safe motion planning, particularly under challenging conditions. Yet, many recent deep learning methods suffer from a degraded performance on the challenging scenarios, mainly because these scenarios appear less frequently in the training data. To address such a long-tail issue, existing methods force challenging scenarios closer together in the feature space during training to trigger information sharing among them for more robust learning. These methods, however, primarily rely on the motion patterns to characterize scenarios, omitting more informative contextual information, such as interactions and scene layout. We argue that exploiting such information not only improves prediction accuracy but also scene compliance of the generated trajectories. In this paper, we propose to incorporate richer training dynamics information into a prototypical contrastive learning framework. More specifically, we propose a two-stage process. First, we generate rich contextual features using a baseline encoder-decoder framework. These features are split into clusters based on the model's output errors, using the training dynamics information, and a prototype is computed within each cluster. Second, we retrain the model using the prototypes in a contrastive learning framework. We conduct empirical evaluations of our approach using two large-scale naturalistic datasets and show that our method achieves state-of-the-art performance by improving accuracy and scene compliance on the long-tail samples. Furthermore, we perform experiments on a subset of the clusters to highlight the additional benefit of our approach in reducing training bias.

5/1/2024