DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation

Read original: arXiv:2403.13322 - Published 5/28/2024 by Yifan Wu, Jiawei Du, Ping Liu, Yuewei Lin, Wenqing Cheng, Wei Xu

DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation

Overview

This paper introduces DD-RobustBench, a new adversarial robustness benchmark for dataset distillation methods.
Dataset distillation aims to compress a large dataset into a small set of synthetic data points that can be used to train models nearly as well as the original dataset.
The authors evaluate the adversarial robustness of several dataset distillation methods and propose new techniques to improve robustness.

Plain English Explanation

The paper discusses a new benchmark called DD-RobustBench that is used to test the adversarial robustness of different dataset distillation methods. Dataset distillation is a way to take a large dataset and create a much smaller dataset that can be used to train machine learning models almost as well as the original. This is useful for situations where the original dataset is too big to work with easily.

The authors evaluate several existing dataset distillation methods to see how robust they are to adversarial attacks - when someone tries to trick the model by making small changes to the input data. They also propose some new techniques that can improve the adversarial robustness of these distilled datasets. The goal is to create dataset distillation methods that can produce small, compressed datasets that are still highly effective and secure against malicious attacks.

Technical Explanation

The paper first reviews related work on dataset distillation and adversarial robustness. It then introduces the DD-RobustBench benchmark, which consists of several adversarial attacks applied to various dataset distillation methods on standard image classification datasets.

The authors evaluate the adversarial robustness of several existing dataset distillation approaches, including Curriculum Dataset Distillation, Self-Supervised Dataset Distillation, and Generative Dataset Distillation. They find that while these methods can achieve good performance on clean data, they are often vulnerable to adversarial attacks.

To address this, the authors propose new techniques for adversarial training and data augmentation to improve the robustness of distilled datasets. They show that these methods can significantly boost the adversarial robustness of the distilled datasets while maintaining high performance on clean data.

Critical Analysis

The authors provide a thorough evaluation of the adversarial robustness of several dataset distillation methods, which is an important and underexplored area of research. By introducing the DD-RobustBench benchmark, they have created a valuable tool for the community to assess the security of these distillation techniques.

One potential limitation is that the paper only considers image classification tasks, and the findings may not generalize to other domains. Additionally, the proposed robustness-enhancing techniques, while effective, may introduce additional complexity or computational overhead that could limit their practical applicability.

Further research could explore the transferability of the distilled datasets to different model architectures or tasks, as well as investigate the scalability and efficiency of the robustness-focused distillation methods. Incorporating feedback from real-world deployment scenarios could also help identify additional challenges and refine the techniques.

Conclusion

This paper presents an important contribution to the field of dataset distillation by introducing a new benchmark for evaluating the adversarial robustness of these techniques. The authors' findings highlight the vulnerability of existing distillation methods to adversarial attacks and propose novel approaches to improve robustness. As dataset distillation becomes increasingly important for efficient machine learning, ensuring the security and reliability of the distilled datasets will be crucial. The DD-RobustBench and the authors' work towards more robust distillation methods represent a significant step in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation

Yifan Wu, Jiawei Du, Ping Liu, Yuewei Lin, Wenqing Cheng, Wei Xu

Dataset distillation is an advanced technique aimed at compressing datasets into significantly smaller counterparts, while preserving formidable training performance. Significant efforts have been devoted to promote evaluation accuracy under limited compression ratio while overlooked the robustness of distilled dataset. In this work, we introduce a comprehensive benchmark that, to the best of our knowledge, is the most extensive to date for evaluating the adversarial robustness of distilled datasets in a unified way. Our benchmark significantly expands upon prior efforts by incorporating a wider range of dataset distillation methods, including the latest advancements such as TESLA and SRe2L, a diverse array of adversarial attack methods, and evaluations across a broader and more extensive collection of datasets such as ImageNet-1K. Moreover, we assessed the robustness of these distilled datasets against representative adversarial attack algorithms like PGD and AutoAttack, while exploring their resilience from a frequency perspective. We also discovered that incorporating distilled data into the training batches of the original dataset can yield to improvement of robustness.

5/28/2024

💬

Towards Trustworthy Dataset Distillation

Shijie Ma, Fei Zhu, Zhen Cheng, Xu-Yao Zhang

Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data, we further propose to corrupt InD samples to generate pseudo-outliers, namely Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and POE surpasses the state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to open-world scenarios. Our code is available at https://github.com/mashijie1028/TrustDD

8/13/2024

Curriculum Dataset Distillation

Zhiheng Ma, Anjia Cao, Funing Yang, Xing Wei

Most dataset distillation methods struggle to accommodate large-scale datasets due to their substantial computational and memory requirements. In this paper, we present a curriculum-based dataset distillation framework designed to harmonize scalability with efficiency. This framework strategically distills synthetic images, adhering to a curriculum that transitions from simple to complex. By incorporating curriculum evaluation, we address the issue of previous methods generating images that tend to be homogeneous and simplistic, doing so at a manageable computational cost. Furthermore, we introduce adversarial optimization towards synthetic images to further improve their representativeness and safeguard against their overfitting to the neural network involved in distilling. This enhances the generalization capability of the distilled images across various neural network architectures and also increases their robustness to noise. Extensive experiments demonstrate that our framework sets new benchmarks in large-scale dataset distillation, achieving substantial improvements of 11.1% on Tiny-ImageNet, 9.0% on ImageNet-1K, and 7.3% on ImageNet-21K. The source code will be released to the community.

5/16/2024

What is Dataset Distillation Learning?

William Yang, Ye Zhu, Zhiwei Deng, Olga Russakovsky

Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.

7/23/2024