Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

Read original: arXiv:2407.00389 - Published 7/2/2024 by Chao Zhou, Xiaowen Shi, Yuan-Gen Wang

Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

Overview

The paper presents a query-efficient hard-label black-box attack against Vision Transformers (ViTs), a type of machine learning model used for computer vision tasks.
The proposed attack method aims to fool the ViT model into misclassifying input images while minimizing the number of queries (or image evaluations) required.
This is important because black-box attacks, where the attacker has no access to the model's internal parameters, are more practical in real-world scenarios compared to white-box attacks.

Plain English Explanation

The researchers have developed a technique to trick Vision Transformer (ViT) models, which are used for computer vision tasks like image classification. ViT models are a type of machine learning model that works by processing images in a different way than traditional convolutional neural networks.

The researchers' technique is designed to fool the ViT model into incorrectly classifying an image, without knowing the internal details of how the model works. This is known as a "black-box" attack, as the attacker doesn't have access to the model's inner workings. Black-box attacks are more realistic in real-world scenarios compared to "white-box" attacks, where the attacker has full knowledge of the model.

The key innovation of this work is that the researchers' attack method is "query-efficient," meaning it can achieve the desired misclassification by making a relatively small number of queries (or evaluations) of the target ViT model. This is important because in a real-world setting, the number of queries an attacker can make may be limited, for example, due to computational constraints or the model owner's security measures.

Technical Explanation

The paper introduces a query-efficient hard-label black-box attack against Vision Transformers (ViTs). The proposed attack, called QEBA-ViT, aims to fool the ViT model into misclassifying an input image while minimizing the number of queries (or image evaluations) required.

The attack works by iteratively perturbing the input image in a direction that is predicted to increase the likelihood of misclassification, without accessing the model's internal parameters. This is achieved by using a gradient estimation technique that approximates the gradient of the loss function with respect to the input, based on the hard classification labels provided by the target ViT model.

The researchers evaluate the performance of QEBA-ViT on several ViT models and benchmark datasets, and compare it to other state-of-the-art black-box attack methods. The results show that QEBA-ViT can achieve high attack success rates while requiring significantly fewer queries than the baseline approaches.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated attack method against Vision Transformers. The key strength of the work is the query-efficient nature of the proposed attack, which is an important practical consideration for real-world black-box attacks.

However, the paper does not address the potential ethical concerns and societal implications of such attacks. While the research aims to better understand the vulnerabilities of ViT models, the techniques could also be misused by bad actors to compromise the security of machine learning systems in harmful ways. The authors could have included a more extensive discussion of these issues and the importance of developing robust defenses against black-box attacks.

Additionally, the paper focuses solely on attacking ViT models and does not investigate the broader applicability of the QEBA-ViT method to other types of machine learning models. It would be interesting to see how the technique performs against other architectures, such as convolutional neural networks or generative adversarial networks.

Conclusion

The paper presents a novel query-efficient hard-label black-box attack against Vision Transformers, a type of machine learning model used for computer vision tasks. The proposed QEBA-ViT method demonstrates the ability to fool ViT models into misclassifying input images while requiring significantly fewer queries than existing black-box attack approaches.

While the technical merits of the work are strong, the authors could have provided a more comprehensive discussion of the ethical implications and potential misuse of such attack techniques. Additionally, investigating the broader applicability of the QEBA-ViT method to other machine learning models would be a valuable area for future research.

Overall, this work contributes to our understanding of the vulnerabilities of Vision Transformers and the importance of developing robust defenses against black-box attacks in real-world machine learning systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Query-Efficient Hard-Label Black-Box Attack against Vision Transformers

Chao Zhou, Xiaowen Shi, Yuan-Gen Wang

Recent studies have revealed that vision transformers (ViTs) face similar security risks from adversarial attacks as deep convolutional neural networks (CNNs). However, directly applying attack methodology on CNNs to ViTs has been demonstrated to be ineffective since the ViTs typically work on patch-wise encoding. This article explores the vulnerability of ViTs against adversarial attacks under a black-box scenario, and proposes a novel query-efficient hard-label adversarial attack method called AdvViT. Specifically, considering that ViTs are highly sensitive to patch modification, we propose to optimize the adversarial perturbation on the individual patches. To reduce the dimension of perturbation search space, we modify only a handful of low-frequency components of each patch. Moreover, we design a weight mask matrix for all patches to further optimize the perturbation on different regions of a whole image. We test six mainstream ViT backbones on the ImageNet-1k dataset. Experimental results show that compared with the state-of-the-art attacks on CNNs, our AdvViT achieves much lower $L_2$-norm distortion under the same query budget, sufficiently validating the vulnerability of ViTs against adversarial attacks.

7/2/2024

Multi-Attribute Vision Transformers are Efficient and Robust Learners

Hanan Gani, Nada Saadi, Noor Hussein, Karthik Nandakumar

Since their inception, Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) across a wide spectrum of tasks. ViTs exhibit notable characteristics, including global attention, resilience against occlusions, and adaptability to distribution shifts. One underexplored aspect of ViTs is their potential for multi-attribute learning, referring to their ability to simultaneously grasp multiple attribute-related tasks. In this paper, we delve into the multi-attribute learning capability of ViTs, presenting a straightforward yet effective strategy for training various attributes through a single ViT network as distinct tasks. We assess the resilience of multi-attribute ViTs against adversarial attacks and compare their performance against ViTs designed for single attributes. Moreover, we further evaluate the robustness of multi-attribute ViTs against a recent transformer based attack called Patch-Fool. Our empirical findings on the CelebA dataset provide validation for our assertion. Our code is available at https://github.com/hananshafi/MTL-ViT

7/22/2024

👀

AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets

Siyi Du, Nourhan Bayasi, Ghassan Hamarneh, Rafeef Garbi

Skin lesion segmentation (SLS) plays an important role in skin lesion analysis. Vision transformers (ViTs) are considered an auspicious solution for SLS, but they require more training data compared to convolutional neural networks (CNNs) due to their inherent parameter-heavy structure and lack of some inductive biases. To alleviate this issue, current approaches fine-tune pre-trained ViT backbones on SLS datasets, aiming to leverage the knowledge learned from a larger set of natural images to lower the amount of skin training data needed. However, fully fine-tuning all parameters of large backbones is computationally expensive and memory intensive. In this paper, we propose AViT, a novel efficient strategy to mitigate ViTs' data-hunger by transferring any pre-trained ViTs to the SLS task. Specifically, we integrate lightweight modules (adapters) within the transformer layers, which modulate the feature representation of a ViT without updating its pre-trained weights. In addition, we employ a shallow CNN as a prompt generator to create a prompt embedding from the input image, which grasps fine-grained information and CNN's inductive biases to guide the segmentation task on small datasets. Our quantitative experiments on 4 skin lesion datasets demonstrate that AViT achieves competitive, and at times superior, performance to SOTA but with significantly fewer trainable parameters. Our code is available at https://github.com/siyi-wind/AViT.

6/13/2024

Towards Robust Vision Transformer via Masked Adaptive Ensemble

Fudong Lin, Jiadong Lou, Xu Yuan, Nian-Feng Tzeng

Adversarial training (AT) can help improve the robustness of Vision Transformers (ViT) against adversarial attacks by intentionally injecting adversarial examples into the training data. However, this way of adversarial injection inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are still vulnerable to adaptive attacks. To tackle such shortcomings, this paper proposes a novel ViT architecture, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, we empirically discover that detecting adversarial examples can benefit from the Guided Backpropagation technique. Driven by this discovery, a novel Multi-head Self-Attention (MSA) mechanism is introduced to enhance our detector to sniff adversarial examples. Then, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples, with our adaptive ensemble to adaptively adjust the proportion of visual representations from the two encoders for accurate classification. This design enables our ViT architecture to achieve a better trade-off between standard accuracy and robustness. Besides, our adaptive ensemble technique allows us to mask off a random subset of image patches within input data, boosting our ViT's robustness against adaptive attacks, while maintaining high standard accuracy. Experimental results exhibit that our ViT architecture, on CIFAR-10, achieves the best standard accuracy and adversarial robustness of 90.3% and 49.8%, respectively.

7/23/2024