Dataset Condensation for Time Series Classification via Dual Domain Matching

Read original: arXiv:2403.07245 - Published 6/11/2024 by Zhanyu Liu, Ke Hao, Guanjie Zheng, Yanwei Yu

Dataset Condensation for Time Series Classification via Dual Domain Matching

Overview

This paper proposes a dataset condensation method for time series classification tasks.
The method aims to condense the original dataset into a smaller set of synthetic samples that can effectively train a classification model.
It uses a dual domain matching approach to align the distribution of the synthetic samples with the original data in both the time and frequency domains.

Plain English Explanation

The paper presents a way to create a smaller, condensed dataset that can be used to train a model for classifying time series data. Time series data refers to data that changes over time, like stock prices or temperature readings. The idea is to generate a set of synthetic samples that capture the key characteristics of the original dataset, so you don't need to use all the original data to train the model.

The key innovation is using a "dual domain matching" approach. This means the synthetic samples are designed to match the original data in both the time domain (how the data changes over time) and the frequency domain (the different frequencies or patterns present in the data). By aligning the synthetic data with the original in these two ways, the model can be trained effectively on the condensed dataset.

This builds on prior work on dataset condensation, which aimed to create compact datasets for training computer vision models. Applying this idea to time series data is an interesting extension, as time series data has its own unique challenges compared to images.

Technical Explanation

The paper proposes a dataset condensation method for time series classification that uses a dual domain matching approach. The key steps are:

Preprocessing: The original time series data is preprocessed to extract features in both the time and frequency domains.
Synthetic Sample Generation: An optimization process is used to generate a set of synthetic time series samples that match the original data's distribution in both the time and frequency domains.
Model Training: The condensed dataset of synthetic samples is used to train a time series classification model.

The dual domain matching is implemented by defining loss functions that quantify the alignment between the original and synthetic data in the time and frequency domains. These losses are minimized during the synthetic sample generation process.

The paper evaluates the proposed method on several time series classification benchmarks and shows it can achieve comparable or better performance to using the full original dataset, while only requiring a fraction of the training data.

Critical Analysis

The paper provides a novel approach to dataset condensation for time series classification, which is an important problem as collecting and storing large amounts of time series data can be costly and difficult.

One limitation mentioned in the paper is that the method requires access to the full original dataset during the synthetic sample generation process. This may not always be feasible in real-world applications where the full dataset is not available.

Additionally, the paper does not explore how the proposed method would scale to very large or high-dimensional time series datasets. The experiments are conducted on relatively small benchmark datasets, and further research may be needed to understand the method's performance on more complex real-world time series data.

Approaches like "Calibrated Dataset Condensation" that can enable faster hyperparameter search may also be worth investigating in the context of time series classification tasks.

Another potential area for future research is exploring multi-size dataset condensation techniques that can generate synthetic samples at different scales, which may be beneficial for training models on time series data with varying temporal resolutions.

Conclusion

This paper presents a novel dataset condensation method for time series classification tasks that uses a dual domain matching approach to generate synthetic samples that effectively capture the characteristics of the original data. The method shows promise in reducing the amount of training data required while maintaining model performance, which could have significant practical implications for time series analysis and modeling. However, further research is needed to address the limitations and explore additional extensions of the technique.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Dataset Condensation for Time Series Classification via Dual Domain Matching

Zhanyu Liu, Ke Hao, Guanjie Zheng, Yanwei Yu

Time series data has been demonstrated to be crucial in various research fields. The management of large quantities of time series data presents challenges in terms of deep learning tasks, particularly for training a deep neural network. Recently, a technique named textit{Dataset Condensation} has emerged as a solution to this problem. This technique generates a smaller synthetic dataset that has comparable performance to the full real dataset in downstream tasks such as classification. However, previous methods are primarily designed for image and graph datasets, and directly adapting them to the time series dataset leads to suboptimal performance due to their inability to effectively leverage the rich information inherent in time series data, particularly in the frequency domain. In this paper, we propose a novel framework named Dataset textit{textbf{Cond}}ensation for textit{textbf{T}}ime textit{textbf{S}}eries textit{textbf{C}}lassification via Dual Domain Matching (textbf{CondTSC}) which focuses on the time series classification dataset condensation task. Different from previous methods, our proposed framework aims to generate a condensed dataset that matches the surrogate objectives in both the time and frequency domains. Specifically, CondTSC incorporates multi-view data augmentation, dual domain training, and dual surrogate objectives to enhance the dataset condensation process in the time and frequency domains. Through extensive experiments, we demonstrate the effectiveness of our proposed framework, which outperforms other baselines and learns a condensed synthetic dataset that exhibits desirable characteristics such as conforming to the distribution of the original data.

6/11/2024

CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting

Jianrong Ding, Zhanyu Liu, Guanjie Zheng, Haiming Jin, Linghe Kong

Dataset condensation is a newborn technique that generates a small dataset that can be used in training deep neural networks to lower training costs. The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets. However, existing methods predominantly concentrate on classification tasks, posing challenges in their adaptation to time series forecasting (TS-forecasting). This challenge arises from disparities in the evaluation of synthetic data. In classification, the synthetic data is considered well-distilled if the model trained with the full dataset and the model trained with the synthetic dataset yield identical labels for the same input, regardless of variations in output logits distribution. Conversely, in TS-forecasting, the effectiveness of synthetic data distillation is determined by the distance between predictions of the two models. The synthetic data is deemed well-distilled only when all data points within the predictions are similar. Consequently, TS-forecasting has a more rigorous evaluation methodology compared to classification. To mitigate this gap, we theoretically analyze the optimization objective of dataset condensation for TS-forecasting and propose a new one-line plugin of dataset condensation designated as Dataset Condensation for Time Series Forecasting (CondTSF) based on our analysis. Plugging CondTSF into previous dataset condensation methods facilitates a reduction in the distance between the predictions of the model trained with the full dataset and the model trained with the synthetic dataset, thereby enhancing performance. We conduct extensive experiments on eight commonly used time series datasets. CondTSF consistently improves the performance of all previous dataset condensation methods across all datasets, particularly at low condensing ratios.

6/12/2024

Elucidating the Design Space of Dataset Condensation

Shitong Shao, Zikai Zhou, Huanran Chen, Zhiqiang Shen

Dataset condensation, a concept within data-centric learning, efficiently transfers critical attributes from an original dataset to a synthetic version, maintaining both diversity and realism. This approach significantly improves model training efficiency and is adaptable across multiple application areas. Previous methods in dataset condensation have faced challenges: some incur high computational costs which limit scalability to larger datasets (e.g., MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets (e.g., SRe2L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching and adjusting the learning rate schedule. These strategies are grounded in empirical evidence and theoretical backing. Our resulting approach, Elucidate Dataset Condensation (EDC), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%. This performance exceeds those of SRe2L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively.

5/7/2024

DANCE: Dual-View Distribution Alignment for Dataset Condensation

Hansong Zhang, Shikun Li, Fanzhao Lin, Weiping Wang, Zhenxing Qian, Shiming Ge

Dataset condensation addresses the problem of data burden by learning a small synthetic training set that preserves essential knowledge from the larger real training set. To date, the state-of-the-art (SOTA) results are often yielded by optimization-oriented methods, but their inefficiency hinders their application to realistic datasets. On the other hand, the Distribution-Matching (DM) methods show remarkable efficiency but sub-optimal results compared to optimization-oriented methods. In this paper, we reveal the limitations of current DM-based methods from the inner-class and inter-class views, i.e., Persistent Training and Distribution Shift. To address these problems, we propose a new DM-based method named Dual-view distribution AligNment for dataset CondEnsation (DANCE), which exploits a few pre-trained models to improve DM from both inner-class and inter-class views. Specifically, from the inner-class view, we construct multiple middle encoders to perform pseudo long-term distribution alignment, making the condensed set a good proxy of the real one during the whole training process; while from the inter-class view, we use the expert models to perform distribution calibration, ensuring the synthetic data remains in the real class region during condensing. Experiments demonstrate the proposed method achieves a SOTA performance while maintaining comparable efficiency with the original DM across various scenarios. Source codes are available at https://github.com/Hansong-Zhang/DANCE.

6/4/2024