Carefully Blending Adversarial Training and Purification Improves Adversarial Robustness

Read original: arXiv:2306.06081 - Published 5/24/2024 by Emanuele Ballarin, Alessio Ansuini, Luca Bortolussi

🏋️

Overview

Proposed a new adversarial defense mechanism called CARSO for image classification
Combines adversarial training and adversarial purification to enhance model robustness
CARSO maps the internal representation of a potentially perturbed input to a distribution of tentative clean reconstructions
Multiple samples from this distribution are classified by the same adversarially-trained model, and the outputs are aggregated for a robust prediction
Experiments show CARSO can defend against strong adaptive white-box attacks, with a modest loss in clean accuracy

Plain English Explanation

CARSO is a new technique that aims to make image classification models more robust against adversarial attacks. Adversarial attacks are small, carefully crafted changes to an image that can trick a model into misclassifying it. CARSO combines two existing approaches - adversarial training and adversarial purification - to create a defense mechanism that is more effective than either approach alone.

The key idea behind CARSO is to take a potentially perturbed input (i.e., an image that has been adversarially attacked) and map its internal representation to a distribution of tentative "clean" reconstructions. Multiple samples from this distribution are then classified by the same adversarially-trained model, and the outputs are combined to produce a robust prediction.

The researchers evaluated CARSO using a well-established benchmark of strong adaptive attacks, across different image datasets. They found that CARSO could effectively defend against these powerful white-box attacks (where the attacker has full knowledge of the model's architecture and parameters) while only incurring a modest loss in clean accuracy (i.e., the model's performance on unperturbed images).

Technical Explanation

CARSO builds upon an adversarially-trained image classifier. It learns to map the internal representation of a potentially perturbed input onto a distribution of tentative clean reconstructions. This is done using a neural network that takes the classifier's internal features as input and outputs the parameters of a Gaussian distribution.

Multiple samples are then drawn from this distribution and classified by the same adversarially-trained model. The outputs of these classifications are aggregated (e.g., by majority vote) to produce the final robust prediction. This process of sampling and aggregating is inspired by stochastic defenses, which have been shown to improve robustness.

The researchers evaluated CARSO using the well-established AutoAttack benchmark, which comprises a suite of strong adaptive white-box attacks. Across different image datasets (CIFAR-10, CIFAR-100, and TinyImageNet-200), CARSO demonstrated significant improvements in $ell_infty$ robust classification accuracy compared to the state-of-the-art, while only incurring a modest clean accuracy toll.

Critical Analysis

The researchers acknowledge that CARSO, like other adversarial defenses, may not be a panacea. They mention that the method could potentially be vulnerable to more advanced attacks that specifically target the reconstruction process or the aggregation of multiple classifications.

Additionally, the computational overhead of CARSO may be a concern, as it requires multiple forward passes through the classifier for each input. This could limit its practical applicability, especially for real-time or high-throughput applications.

It would be interesting to see how CARSO performs against adversarial detectors or other defense mechanisms that aim to identify and remove adversarial perturbations. Additionally, exploring the scaling laws governing the trade-off between clean and robust accuracy could provide further insights into the fundamental limitations of CARSO and other adversarial defenses.

Conclusion

The CARSO method represents a promising new approach to enhancing the adversarial robustness of image classification models. By combining adversarial training and adversarial purification in a synergistic manner, the researchers have demonstrated significant improvements in robust accuracy against strong adaptive attacks, while maintaining a relatively small clean accuracy penalty.

While CARSO may not be a silver bullet for adversarial robustness, it contributes to the ongoing research efforts to make AI systems more secure and reliable. As the field of adversarial machine learning continues to evolve, techniques like CARSO will play an important role in pushing the boundaries of what is possible and informing the development of more robust and trustworthy AI applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Carefully Blending Adversarial Training and Purification Improves Adversarial Robustness

Emanuele Ballarin, Alessio Ansuini, Luca Bortolussi

In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and an aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. Paying a modest clean accuracy toll, our method improves by a significant margin the state-of-the-art for CIFAR-10, CIFAR-100, and TinyImageNet-200 $ell_infty$ robust classification accuracy against AutoAttack. Code, and instructions to obtain pre-trained models are available at https://github.com/emaballarin/CARSO .

5/24/2024

🖼️

Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing

Yatong Bai, Brendon G. Anderson, Aerin Kim, Somayeh Sojoudi

While prior research has proposed a plethora of methods that build neural classifiers robust against adversarial robustness, practitioners are still reluctant to adopt them due to their unacceptably severe clean accuracy penalties. This paper significantly alleviates this accuracy-robustness trade-off by mixing the output probabilities of a standard classifier and a robust classifier, where the standard network is optimized for clean accuracy and is not robust in general. We show that the robust base classifier's confidence difference for correct and incorrect examples is the key to this improvement. In addition to providing intuitions and empirical evidence, we theoretically certify the robustness of the mixed classifier under realistic assumptions. Furthermore, we adapt an adversarial input detector into a mixing network that adaptively adjusts the mixture of the two base models, further reducing the accuracy penalty of achieving robustness. The proposed flexible method, termed adaptive smoothing, can work in conjunction with existing or even future methods that improve clean accuracy, robustness, or adversary detection. Our empirical evaluation considers strong attack methods, including AutoAttack and adaptive attack. On the CIFAR-100 dataset, our method achieves an 85.21% clean accuracy while maintaining a 38.72% $ell_infty$-AutoAttacked ($epsilon = 8/255$) accuracy, becoming the second most robust method on the RobustBench CIFAR-100 benchmark as of submission, while improving the clean accuracy by ten percentage points compared with all listed models. The code that implements our method is available at https://github.com/Bai-YT/AdaptiveSmoothing.

7/23/2024

ZeroPur: Succinct Training-Free Adversarial Purification

Xiuli Bi, Zonglin Yang, Bo Liu, Xiaodong Cun, Chi-Man Pun, Pietro Lio, Bin Xiao

Adversarial purification is a kind of defense technique that can defend various unseen adversarial attacks without modifying the victim classifier. Existing methods often depend on external generative models or cooperation between auxiliary functions and victim classifiers. However, retraining generative models, auxiliary functions, or victim classifiers relies on the domain of the fine-tuned dataset and is computation-consuming. In this work, we suppose that adversarial images are outliers of the natural image manifold and the purification process can be considered as returning them to this manifold. Following this assumption, we present a simple adversarial purification method without further training to purify adversarial images, called ZeroPur. ZeroPur contains two steps: given an adversarial example, Guided Shift obtains the shifted embedding of the adversarial example by the guidance of its blurred counterparts; after that, Adaptive Projection constructs a directional vector by this shifted embedding to provide momentum, projecting adversarial images onto the manifold adaptively. ZeroPur is independent of external models and requires no retraining of victim classifiers or auxiliary functions, relying solely on victim classifiers themselves to achieve purification. Extensive experiments on three datasets (CIFAR-10, CIFAR-100, and ImageNet-1K) using various classifier architectures (ResNet, WideResNet) demonstrate that our method achieves state-of-the-art robust performance. The code will be publicly available.

6/6/2024

🛠️

Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness

Stanislav Fort, Balaji Lakshminarayanan

Adversarial examples pose a significant challenge to the robustness, reliability and alignment of deep neural networks. We propose a novel, easy-to-use approach to achieving high-quality representations that lead to adversarial robustness through the use of multi-resolution input representations and dynamic self-ensembling of intermediate layer predictions. We demonstrate that intermediate layer predictions exhibit inherent robustness to adversarial attacks crafted to fool the full classifier, and propose a robust aggregation mechanism based on Vickrey auction that we call textit{CrossMax} to dynamically ensemble them. By combining multi-resolution inputs and robust ensembling, we achieve significant adversarial robustness on CIFAR-10 and CIFAR-100 datasets without any adversarial training or extra data, reaching an adversarial accuracy of $approx$72% (CIFAR-10) and $approx$48% (CIFAR-100) on the RobustBench AutoAttack suite ($L_infty=8/255)$ with a finetuned ImageNet-pretrained ResNet152. This represents a result comparable with the top three models on CIFAR-10 and a +5 % gain compared to the best current dedicated approach on CIFAR-100. Adding simple adversarial training on top, we get $approx$78% on CIFAR-10 and $approx$51% on CIFAR-100, improving SOTA by 5 % and 9 % respectively and seeing greater gains on the harder dataset. We validate our approach through extensive experiments and provide insights into the interplay between adversarial robustness, and the hierarchical nature of deep representations. We show that simple gradient-based attacks against our model lead to human-interpretable images of the target classes as well as interpretable image changes. As a byproduct, using our multi-resolution prior, we turn pre-trained classifiers and CLIP models into controllable image generators and develop successful transferable attacks on large vision language models.

8/13/2024