Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Read original: arXiv:2405.05524 - Published 5/10/2024 by Peng-Fei Zhang, Zi Huang, Guangdong Bai

🔎

Overview

This paper investigates the adversarial robustness of vision-language pre-trained (VLP) models, which are foundational for numerous vision-language tasks.
Traditionally, adversarial perturbations target specific VLP models, datasets, and/or downstream tasks, which limits transferability and incurs additional computation costs.
The authors propose a novel black-box method called Effective and Transferable Universal Adversarial Attack (ETU) to generate Universal Adversarial Perturbations (UAPs) that can mislead a variety of existing VLP models across different downstream tasks.
The authors also introduce a data augmentation technique called ScMix to enhance the effectiveness and transferability of the generated UAPs.

Plain English Explanation

Vision-language models are a type of artificial intelligence that can understand and process both visual and textual information. These models have become very important for various applications, such as image captioning, visual question answering, and multimodal retrieval.

However, researchers have found that these models can be vulnerable to adversarial attacks, which are small, intentional changes to the input that can cause the model to make mistakes. Traditionally, researchers have generated these adversarial perturbations for specific models, datasets, or tasks, but this approach has limitations. The perturbations often don't work well when applied to different models or tasks, and generating them can be computationally expensive.

In this paper, the authors propose a new method called Effective and Transferable Universal Adversarial Attack (ETU) to generate Universal Adversarial Perturbations (UAPs). These are perturbations that can mislead a variety of vision-language models across different tasks, without needing to be tailored to each individual model or task.

The key idea behind ETU is to design the UAPs to have both global and local utility, meaning they can affect the models at both the overall and the local, granular level. This helps the UAPs be more effective and transferable to different models and tasks.

The authors also introduce a data augmentation technique called ScMix that can further enhance the effectiveness and transferability of the generated UAPs. ScMix involves mixing the original images and text in different ways to create more diverse training data for the UAP generation process.

Through extensive experiments, the authors demonstrate that their proposed ETU method, combined with the ScMix data augmentation, can generate highly effective and transferable universal adversarial perturbations for a variety of vision-language models and tasks.

Technical Explanation

The authors propose a novel black-box method called Effective and Transferable Universal Adversarial Attack (ETU) to generate Universal Adversarial Perturbations (UAPs) that can mislead a variety of existing vision-language pre-trained (VLP) models across different downstream tasks.

The key components of the ETU method are:

Global and Local Utility: The authors design the UAPs to have both global and local utility, meaning they can affect the models at both the overall and the local, granular level. This benefits the overall utility while reducing interactions between UAP units, improving the transferability.
Cross-Modal Interactions: The ETU comprehensively takes into account the intrinsic cross-modal interactions between the image and text modalities to generate effective UAPs.

To further enhance the effectiveness and transferability of the UAPs, the authors also propose a novel data augmentation method called ScMix. ScMix consists of:

Self-mix: Applying various image and text transformations to the original data to increase diversity.
Cross-mix: Mixing the transformed image and text in different ways to create new multimodal samples.

The authors conduct comprehensive experiments on various downstream tasks, VLP models, and datasets to demonstrate the effectiveness and transferability of their proposed ETU method and ScMix data augmentation.

Critical Analysis

The paper provides a thorough investigation of the adversarial robustness of VLP models and proposes an effective solution to generate universal adversarial perturbations that can be applied across different models and tasks.

One potential limitation is that the authors only evaluate their method on a limited set of VLP models and downstream tasks. It would be valuable to see the performance of ETU and ScMix on a broader range of VLP architectures and applications to further demonstrate the generalizability of the approach.

Additionally, the authors do not discuss the potential real-world implications or ethical considerations of their work. While adversarial attacks can be used to demonstrate model vulnerabilities, they can also be misused for malicious purposes. It would be important for the authors to address these concerns and provide guidance on responsible use of such techniques.

Overall, the paper presents a novel and promising approach to generating universal adversarial perturbations for VLP models. However, further research is needed to fully understand the broader implications and potential applications/misuses of this technology.

Conclusion

This paper introduces a novel method called Effective and Transferable Universal Adversarial Attack (ETU) to generate Universal Adversarial Perturbations (UAPs) that can effectively mislead a variety of existing vision-language pre-trained (VLP) models across different downstream tasks.

The key innovations of the ETU method are the design of UAPs with both global and local utility, as well as the comprehensive consideration of cross-modal interactions between the image and text modalities. The authors also propose a data augmentation technique called ScMix to further enhance the effectiveness and transferability of the generated UAPs.

Through extensive experiments, the authors demonstrate the effectiveness and transferability of their proposed approach, highlighting the importance of assessing the adversarial robustness of these powerful VLP models as they become more prevalent. This work contributes to the ongoing efforts to understand and mitigate the vulnerabilities of multimodal AI systems in an increasingly interconnected world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Peng-Fei Zhang, Zi Huang, Guangdong Bai

Vision-language pre-trained (VLP) models have been the foundation of numerous vision-language tasks. Given their prevalence, it be- comes imperative to assess their adversarial robustness, especially when deploying them in security-crucial real-world applications. Traditionally, adversarial perturbations generated for this assessment target specific VLP models, datasets, and/or downstream tasks. This practice suffers from low transferability and additional computation costs when transitioning to new scenarios. In this work, we thoroughly investigate whether VLP models are commonly sensitive to imperceptible perturbations of a specific pattern for the image modality. To this end, we propose a novel black-box method to generate Universal Adversarial Perturbations (UAPs), which is so called the Effective and T ransferable Universal Adversarial Attack (ETU), aiming to mislead a variety of existing VLP models in a range of downstream tasks. The ETU comprehensively takes into account the characteristics of UAPs and the intrinsic cross-modal interactions to generate effective UAPs. Under this regime, the ETU encourages both global and local utilities of UAPs. This benefits the overall utility while reducing interactions between UAP units, improving the transferability. To further enhance the effectiveness and transferability of UAPs, we also design a novel data augmentation method named ScMix. ScMix consists of self-mix and cross-mix data transformations, which can effectively increase the multi-modal data diversity while preserving the semantics of the original data. Through comprehensive experiments on various downstream tasks, VLP models, and datasets, we demonstrate that the proposed method is able to achieve effective and transferrable universal adversarial attacks.

5/10/2024

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Shutao Xia, Ke Xu

Vision-Language Pre-training (VLP) models trained on large-scale image-text pairs have demonstrated unprecedented capability in many practical applications. However, previous studies have revealed that VLP models are vulnerable to adversarial samples crafted by a malicious adversary. While existing attacks have achieved great success in improving attack effect and transferability, they all focus on instance-specific attacks that generate perturbations for each input sample. In this paper, we show that VLP models can be vulnerable to a new class of universal adversarial perturbation (UAP) for all input samples. Although initially transplanting existing UAP algorithms to perform attacks showed effectiveness in attacking discriminative models, the results were unsatisfactory when applied to VLP models. To this end, we revisit the multimodal alignments in VLP model training and propose the Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC). Specifically, we first design a generator that incorporates cross-modal information as conditioning input to guide the training. To further exploit cross-modal interactions, we propose to formulate the training objective as a multimodal contrastive learning paradigm based on our constructed positive and negative image-text pairs. By training the conditional generator with the designed loss, we successfully force the adversarial samples to move away from its original area in the VLP model's feature space, and thus essentially enhance the attacks. Extensive experiments show that our method achieves remarkable attack performance across various VLP models and Vision-and-Language (V+L) tasks. Moreover, C-PGC exhibits outstanding black-box transferability and achieves impressive results in fooling prevalent large VLP models including LLaVA and Qwen-VL.

6/11/2024

Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models

Haonan Zheng, Wen Jiang, Xinyang Deng, Wenrui Li

Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training (VLP) models to subtle yet intentionally designed perturbations in images and texts. Investigating multimodal systems' robustness via adversarial attacks is crucial in this field. Most multimodal attacks are sample-specific, generating a unique perturbation for each sample to construct adversarial samples. To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image. Initially, we explore strategies to move sample points beyond the decision boundaries of linear classifiers, refining the algorithm to ensure successful attacks under the top $k$ accuracy metric. Based on this foundation, in visual-language tasks, we treat visual and textual modalities as reciprocal sample points and decision hyperplanes, guiding image embeddings to traverse text-constructed decision boundaries, and vice versa. This iterative process consistently refines a universal perturbation, ultimately identifying a singular direction within the input space which is exploitable to impair the retrieval performance of VLP models. The proposed algorithms support the creation of global perturbations or adversarial patches. Comprehensive experiments validate the effectiveness of our method, showcasing its data, task, and model transferability across various VLP models and datasets. Code: https://github.com/LibertazZ/MUAP

8/7/2024

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Jiwei Guan, Tianyu Ding, Longbing Cao, Lei Pan, Chen Wang, Xi Zheng

Vision-language pretraining (VLP) with transformers has demonstrated exceptional performance across numerous multimodal tasks. However, the adversarial robustness of these models has not been thoroughly investigated. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities, particularly in the context of cross-attention mechanisms. In this paper, we study the adversarial vulnerability of recent VLP transformers and design a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities under white-box settings. JMTFA strategically targets attention relevance scores to disrupt important features within each modality, generating adversarial samples by fusing perturbations and leading to erroneous model predictions. Experimental results indicate that the proposed approach achieves high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, our findings reveal that the textual modality significantly influences the complex fusion processes within VLP transformers. Moreover, we observe no apparent relationship between model size and adversarial robustness under our proposed attacks. These insights emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.

8/27/2024