MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification

Read original: arXiv:2406.13066 - Published 6/21/2024 by Harrison Gietz, Jugal Kalita

MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification

Overview

This paper introduces MaskPure, a novel approach to improving a model's defense against text adversaries using stochastic purification.
The key idea is to apply a stochastic masking function to the input text, which acts as a form of "purification" by removing potentially harmful perturbations.
The authors show that MaskPure can significantly boost the robustness of language models against a wide range of text-based attacks, outperforming existing defense methods.

Plain English Explanation

The paper introduces a new technique called MaskPure that helps protect machine learning models from malicious text-based attacks. Text adversaries are bad actors who try to fool models by making small, imperceptible changes to the input text, like swapping out a word or adding a typo. These changes can cause the model to make incorrect predictions, which can be a big problem in real-world applications.

MaskPure works by randomly "masking" or hiding certain parts of the input text before it's fed into the model. This acts as a kind of "purification" process, removing the harmful modifications made by the text adversaries. The authors show that this simple technique can significantly boost the model's ability to resist a wide variety of text-based attacks, outperforming other defense methods.

The key insight is that by randomly obscuring parts of the input, MaskPure makes it much harder for attackers to reliably insert their malicious changes. Even if they manage to sneak in a perturbation, the model is likely to see a different, "purified" version of the text thanks to the masking. This helps the model stay robust and make accurate predictions, even in the face of adversarial attacks.

Technical Explanation

The paper introduces MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification. The core idea is to apply a stochastic masking function to the input text before feeding it into the target model. This masking acts as a form of "purification," effectively removing or obscuring any adversarial perturbations that have been introduced.

The authors demonstrate that MaskPure can significantly boost the robustness of language models against a wide range of text-based attacks, outperforming existing defense methods like certified adversarial robustness, self-supervised training, and data purification.

The key technical insight is that by randomly masking parts of the input, MaskPure makes it much harder for attackers to reliably insert their malicious changes. Even if an adversary manages to introduce a perturbation, the model is likely to see a different, "purified" version of the text due to the stochastic masking. This helps the model stay robust and make accurate predictions, even in the face of text-based adversarial attacks.

The authors conduct extensive experiments on a variety of datasets and attack scenarios, demonstrating the effectiveness and broad applicability of the MaskPure approach.

Critical Analysis

The paper presents a compelling and well-designed defense mechanism against text-based adversarial attacks. The core idea of using stochastic masking as a form of input purification is both elegant and effective, as demonstrated by the strong empirical results.

One potential limitation is that the masking process may introduce some information loss, which could impact the model's overall performance on clean, non-adversarial inputs. The authors acknowledge this tradeoff and suggest that the masking parameters can be tuned to find the right balance between robustness and clean-input accuracy.

Additionally, the paper does not explore the use of MaskPure in combination with other defense techniques, such as adversarial training or certified robustness. It would be interesting to see how MaskPure might further improve upon or synergize with these other approaches.

Overall, the MaskPure method represents a significant advancement in the field of text-based adversarial robustness, and the paper provides a solid foundation for future research in this important area.

Conclusion

The MaskPure paper introduces a novel and effective defense mechanism against text-based adversarial attacks. By applying a stochastic masking function to the input text, the method acts as a form of "purification," removing or obscuring harmful perturbations introduced by attackers. The authors demonstrate that MaskPure can significantly boost the robustness of language models, outperforming existing defense techniques.

This work represents an important step forward in the ongoing battle against text-based adversarial attacks, which pose a serious threat to the deployment of language models in real-world applications. The insights and techniques presented in this paper will likely inspire further research and development in this critical area of machine learning security and robustness.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification

Harrison Gietz, Jugal Kalita

The improvement of language model robustness, including successful defense against adversarial attacks, remains an open problem. In computer vision settings, the stochastic noising and de-noising process provided by diffusion models has proven useful for purifying input images, thus improving model robustness against adversarial attacks. Similarly, some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting, but improving the quality and efficiency of these methods is necessary for them to remain competitive. We extend upon methods of input text purification that are inspired by diffusion processes, which randomly mask and refill portions of the input text before classification. Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses, while also requiring no adversarial classifier training and without assuming knowledge of the attack type. In addition, we show that MaskPure is provably certifiably robust. To our knowledge, MaskPure is the first stochastic-purification method with demonstrated success against both character-level and word-level attacks, indicating the generalizable and promising nature of stochastic denoising defenses. In summary: the MaskPure algorithm bridges literature on the current strongest certifiable and empirical adversarial defense methods, showing that both theoretical and practical robustness can be obtained together. Code is available on GitHub at https://github.com/hubarruby/MaskPure.

6/21/2024

LightPure: Realtime Adversarial Image Purification for Mobile Devices Using Diffusion Models

Hossein Khalili, Seongbin Park, Vincent Li, Brandan Bright, Ali Payani, Ramana Rao Kompella, Nader Sehatbakhsh

Autonomous mobile systems increasingly rely on deep neural networks for perception and decision-making. While effective, these systems are vulnerable to adversarial machine learning attacks where minor input perturbations can significantly impact outcomes. Common countermeasures involve adversarial training and/or data or network transformation. These methods, though effective, require full access to typically proprietary classifiers and are costly for large models. Recent solutions propose purification models, which add a purification layer before classification, eliminating the need to modify the classifier directly. Despite their effectiveness, these methods are compute-intensive, making them unsuitable for mobile systems where resources are limited and low latency is essential. This paper introduces LightPure, a new method that enhances adversarial image purification. It improves the accuracy of existing purification methods and provides notable enhancements in speed and computational efficiency, making it suitable for mobile devices with limited resources. Our approach uses a two-step diffusion and one-shot Generative Adversarial Network (GAN) framework, prioritizing latency without compromising robustness. We propose several new techniques to achieve a reasonable balance between classification accuracy and adversarial robustness while maintaining desired latency. We design and implement a proof-of-concept on a Jetson Nano board and evaluate our method using various attack scenarios and datasets. Our results show that LightPure can outperform existing methods by up to 10x in terms of latency while achieving higher accuracy and robustness for various attack scenarios. This method offers a scalable and effective solution for real-world mobile systems.

9/4/2024

ZeroPur: Succinct Training-Free Adversarial Purification

Xiuli Bi, Zonglin Yang, Bo Liu, Xiaodong Cun, Chi-Man Pun, Pietro Lio, Bin Xiao

Adversarial purification is a kind of defense technique that can defend various unseen adversarial attacks without modifying the victim classifier. Existing methods often depend on external generative models or cooperation between auxiliary functions and victim classifiers. However, retraining generative models, auxiliary functions, or victim classifiers relies on the domain of the fine-tuned dataset and is computation-consuming. In this work, we suppose that adversarial images are outliers of the natural image manifold and the purification process can be considered as returning them to this manifold. Following this assumption, we present a simple adversarial purification method without further training to purify adversarial images, called ZeroPur. ZeroPur contains two steps: given an adversarial example, Guided Shift obtains the shifted embedding of the adversarial example by the guidance of its blurred counterparts; after that, Adaptive Projection constructs a directional vector by this shifted embedding to provide momentum, projecting adversarial images onto the manifold adaptively. ZeroPur is independent of external models and requires no retraining of victim classifiers or auxiliary functions, relying solely on victim classifiers themselves to achieve purification. Extensive experiments on three datasets (CIFAR-10, CIFAR-100, and ImageNet-1K) using various classifier architectures (ResNet, WideResNet) demonstrate that our method achieves state-of-the-art robust performance. The code will be publicly available.

6/6/2024

DiffuseDef: Improved Robustness to Adversarial Attacks

Zhenhao Li, Marek Rei, Lucia Specia

Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over different existing adversarial defense methods and achieves state-of-the-art performance against common adversarial attacks.

7/2/2024