Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning

Read original: arXiv:2209.11964 - Published 4/1/2024 by Zhengwei Fang, Rui Wang, Tao Huang, Liping Jing

🔍

Overview

Developing strong adversarial examples is crucial for evaluating and improving the robustness of deep neural networks.
Existing adversarial attack methods often have limitations, such as being sensitive to minor image transformations or overfitting to the source model used to generate them.
This paper proposes a new approach called Multiple Asymptotically Normal Distribution Attacks (MultiANDA) that aims to address these limitations.

Plain English Explanation

Deep neural networks, the powerful AI models behind many of today's most advanced technologies, can be vulnerable to adversarial examples - slightly modified inputs that cause the models to make mistakes. Developing effective adversarial examples is important for testing the robustness of these models and finding ways to make them more secure.

However, the adversarial examples created by current methods often have issues. They may only work well against the specific model used to generate them, and small changes to the image can cause them to fail. This is because these methods rely on limited information - typically just a single input example and a few source models.

The new MultiANDA approach aims to overcome these limitations. It learns a probability distribution that describes the space of potential adversarial perturbations, rather than just generating a single adversarial example. This allows it to produce a diverse set of adversarial examples that reliably transfer to different neural network architectures, even those the method hasn't been trained on. The key insight is that the statistical properties of the optimization process used to find adversarial perturbations can be leveraged to estimate this distribution.

Technical Explanation

MultiANDA approximates the posterior distribution over adversarial perturbations by exploiting the asymptotic normality of stochastic gradient ascent (SGA), the optimization algorithm used to find adversarial examples. Specifically, it uses an ensemble of neural networks as a proxy for Bayesian marginalization, allowing it to estimate a mixture of Gaussian distributions that characterizes the space of potential adversarial perturbations.

This approximated posterior distribution captures the geometric information around the local optimum found by SGA, enabling MultiANDA to draw an unlimited number of diverse adversarial perturbations for each input. The authors show through extensive experiments on a variety of deep learning models, both with and without defenses, that MultiANDA outperforms ten state-of-the-art black-box adversarial attack methods in terms of the strength and transferability of the generated adversarial examples.

Critical Analysis

The paper provides a promising new approach for generating high-quality adversarial examples. By modeling the distribution of adversarial perturbations rather than just producing individual examples, MultiANDA is able to better explore the optimization space and generate examples that transfer well to unknown neural network architectures.

However, the authors acknowledge that the proposed method relies on several assumptions, such as the asymptotic normality of SGA and the effectiveness of deep ensembles as a proxy for Bayesian marginalization. While the empirical results are strong, further theoretical and experimental analysis may be needed to fully understand the limitations of these assumptions and the broader applicability of the approach.

Additionally, the paper does not address the potential negative societal impacts of powerful adversarial attacks, such as their use for malicious purposes. As this research advances the state of the art in this area, it will be important for future work to consider responsible development and deployment of these techniques.

Conclusion

This paper introduces a novel approach called MultiANDA that can generate a diverse set of highly transferable adversarial examples for evaluating and improving the robustness of deep neural networks. By modeling the distribution of adversarial perturbations rather than just producing individual examples, MultiANDA represents an important advance in the field of adversarial machine learning. While further research is needed to fully understand the strengths and limitations of the method, this work provides a valuable contribution to the ongoing efforts to make AI systems more secure and reliable.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔍

Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning

Zhengwei Fang, Rui Wang, Tao Huang, Liping Jing

Strong adversarial examples are crucial for evaluating and enhancing the robustness of deep neural networks. However, the performance of popular attacks is usually sensitive, for instance, to minor image transformations, stemming from limited information -- typically only one input example, a handful of white-box source models, and undefined defense strategies. Hence, the crafted adversarial examples are prone to overfit the source model, which hampers their transferability to unknown architectures. In this paper, we propose an approach named Multiple Asymptotically Normal Distribution Attacks (MultiANDA) which explicitly characterize adversarial perturbations from a learned distribution. Specifically, we approximate the posterior distribution over the perturbations by taking advantage of the asymptotic normality property of stochastic gradient ascent (SGA), then employ the deep ensemble strategy as an effective proxy for Bayesian marginalization in this process, aiming to estimate a mixture of Gaussians that facilitates a more thorough exploration of the potential optimization space. The approximated posterior essentially describes the stationary distribution of SGA iterations, which captures the geometric information around the local optimum. Thus, MultiANDA allows drawing an unlimited number of adversarial perturbations for each input and reliably maintains the transferability. Our proposed method outperforms ten state-of-the-art black-box attacks on deep learning models with or without defenses through extensive experiments on seven normally trained and seven defense models.

4/1/2024

📉

Bag of Tricks to Boost Adversarial Transferability

Zeliang Zhang, Wei Yao, Xiaosen Wang

Deep neural networks are widely known to be vulnerable to adversarial examples. However, vanilla adversarial examples generated under the white-box setting often exhibit low transferability across different models. Since adversarial transferability poses more severe threats to practical applications, various approaches have been proposed for better transferability, including gradient-based, input transformation-based, and model-related attacks, etc. In this work, we find that several tiny changes in the existing adversarial attacks can significantly affect the attack performance, eg, the number of iterations and step size. Based on careful studies of existing adversarial attacks, we propose a bag of tricks to enhance adversarial transferability, including momentum initialization, scheduled step size, dual example, spectral-based input transformation, and several ensemble strategies. Extensive experiments on the ImageNet dataset validate the high effectiveness of our proposed tricks and show that combining them can further boost adversarial transferability. Our work provides practical insights and techniques to enhance adversarial transferability, and offers guidance to improve the attack performance on the real-world application through simple adjustments.

7/23/2024

Perturbation Towards Easy Samples Improves Targeted Adversarial Transferability

Junqi Gao, Biqing Qi, Yao Li, Zhichang Guo, Dong Li, Yuming Xing, Dazhi Zhang

The transferability of adversarial perturbations provides an effective shortcut for black-box attacks. Targeted perturbations have greater practicality but are more difficult to transfer between models. In this paper, we experimentally and theoretically demonstrated that neural networks trained on the same dataset have more consistent performance in High-Sample-Density-Regions (HSDR) of each class instead of low sample density regions. Therefore, in the target setting, adding perturbations towards HSDR of the target class is more effective in improving transferability. However, density estimation is challenging in high-dimensional scenarios. Further theoretical and experimental verification demonstrates that easy samples with low loss are more likely to be located in HSDR. Perturbations towards such easy samples in the target class can avoid density estimation for HSDR location. Based on the above facts, we verified that adding perturbations to easy samples in the target class improves targeted adversarial transferability of existing attack methods. A generative targeted attack strategy named Easy Sample Matching Attack (ESMA) is proposed, which has a higher success rate for targeted attacks and outperforms the SOTA generative method. Moreover, ESMA requires only 5% of the storage space and much less computation time comparing to the current SOTA, as ESMA attacks all classes with only one model instead of seperate models for each class. Our code is available at https://github.com/gjq100/ESMA.

6/11/2024

💬

$DA^3$: A Distribution-Aware Adversarial Attack against Language Models

Yibo Wang, Xiangjue Dong, James Caverlee, Philip S. Yu

Language models can be manipulated by adversarial attacks, which introduce subtle perturbations to input data. While recent attack methods can achieve a relatively high attack success rate (ASR), we've observed that the generated adversarial examples have a different data distribution compared with the original examples. Specifically, these adversarial examples exhibit reduced confidence levels and greater divergence from the training data distribution. Consequently, they are easy to detect using straightforward detection methods, diminishing the efficacy of such attacks. To address this issue, we propose a Distribution-Aware Adversarial Attack ($DA^3$) method. $DA^3$ considers the distribution shifts of adversarial examples to improve attacks' effectiveness under detection methods. We further design a novel evaluation metric, the Non-detectable Attack Success Rate (NASR), which integrates both ASR and detectability for the attack task. We conduct experiments on four widely used datasets to validate the attack effectiveness and transferability of adversarial examples generated by $DA^3$ against both the white-box BERT-base and RoBERTa-base models and the black-box LLaMA2-7b model.

9/24/2024