InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Read original: arXiv:2312.01886 - Published 6/27/2024 by Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang

🏅

Overview

Large vision-language models (LVLMs) have impressive capabilities in understanding images and generating responses, but are vulnerable to adversarial attacks.
This paper proposes a novel and practical targeted attack scenario where the adversary only knows the vision encoder of the victim LVLM, but not the proprietary prompts or underlying language model.
To address the challenges of cross-prompt and cross-model transferability in this setting, the paper introduces an instruction-tuned targeted attack (InstructTA) approach.

Plain English Explanation

Large vision-language models are AI systems that can both understand images and generate text in response. These models have become incredibly capable, but they also have a weakness - they can be "tricked" by adversarial examples. Adversarial examples are slightly modified inputs (like images) that cause the model to make mistakes, even though the changes may be imperceptible to a human.

In this paper, the researchers present a new type of adversarial attack that is particularly challenging. The attacker only knows part of the victim model - the vision encoder that processes the images. They don't know the rest of the model, including the language model that generates the text responses, or the specific prompts (instructions) used to get the model to produce a desired output.

To overcome this, the researchers developed a technique called InstructTA. The key ideas are:

Reverse engineering the target response: First, they use a publicly available text-to-image model to try to figure out what kind of image would produce the target response they want the victim model to give.
Inferring the instruction: They then use a large language model (GPT-4) to guess what kind of instruction or prompt would lead the victim model to generate that target response.
Optimizing the adversarial example: With this information, they can build a "local surrogate model" that shares the vision encoder with the victim model. They then optimize an adversarial image that will make the surrogate model produce features similar to the target image, in the hopes that this will also trick the victim model.
Improving transferability: To further improve the chances of the attack working on the actual victim model, they augment the instruction with paraphrased versions generated by GPT-4.

The researchers show that this approach outperforms other adversarial attack methods in terms of both attack success and the ability to transfer the attack to different models and prompts.

Technical Explanation

The key technical components of the InstructTA approach are:

Reverse Image Generation: The researchers use a publicly available text-to-image generative model to convert the target response text into a corresponding target image.
Instruction Inference: They then employ the large language model GPT-4 to infer a reasonable instruction $\boldsymbol{p}'$ that could lead the victim LVLM to generate the target response.
Local Surrogate Model: To overcome the lack of knowledge about the victim LVLM's prompts and underlying language model, the researchers form a local surrogate model that shares the same vision encoder as the victim LVLM. This allows them to extract instruction-aware features of the adversarial image example and the target image.
Optimization: They then minimize the distance between these two sets of features to optimize the adversarial example.
Instruction Tuning: To further improve transferability, the researchers augment the inferred instruction $\boldsymbol{p}'$ with additional instructions paraphrased by GPT-4.

The researchers evaluate their InstructTA approach through extensive experiments and demonstrate its superiority in targeted attack performance and transferability compared to other methods.

Critical Analysis

The researchers address an important and practical challenge in securing large vision-language models against adversarial attacks. Their InstructTA approach is novel and shows promising results, but there are a few potential limitations and areas for further research:

Dependence on Proxy Models: The approach relies on the availability of a public text-to-image model and GPT-4 to reverse engineer the target response and infer the instruction. The performance may be affected if these proxy models are not sufficiently accurate or transferable.
Scalability: The need to build a local surrogate model may limit the scalability of the approach, especially for very large victim models.
Real-World Applicability: The paper focuses on a specific targeted attack scenario, but the practicality and effectiveness of the approach in real-world settings with diverse adversarial goals and constraints remains to be explored.

Overall, the InstructTA approach represents an interesting and valuable contribution to the field of adversarial attacks on large vision-language models. The researchers have identified an important practical challenge and proposed a novel solution, which opens up avenues for further research and development in this area.

Conclusion

This paper proposes a novel InstructTA approach to deliver targeted adversarial attacks on large vision-language models (LVLMs) in a practical setting where the attacker only has access to the victim's vision encoder. The key ideas involve reverse engineering the target response, inferring the instruction, and optimizing an adversarial example using a local surrogate model.

The researchers demonstrate the superiority of their approach in terms of attack performance and transferability, highlighting the importance of addressing practical adversarial challenges in the deployment of powerful AI models. While the approach has some limitations, it represents an important step forward in securing LVLMs against malicious attacks and paves the way for further research in this critical area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏅

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang

Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed textsc{InstructTA}) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to reverse the target response into a target image, and employ GPT-4 to infer a reasonable instruction $boldsymbol{p}^prime$ from the target response. We then form a local surrogate model (sharing the same vision encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability with instruction tuning, we augment the instruction $boldsymbol{p}^prime$ with instructions paraphrased from GPT-4. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability. The code is available at https://github.com/xunguangwang/InstructTA.

6/27/2024

🌿

Adversarial Prompt Tuning for Vision-Language Models

Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, Jitao Sang

With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities. However, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. This paper introduces Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs. AdvPT innovatively leverages learnable text prompts and aligns them with adversarial image embeddings, to address the vulnerabilities inherent in VLMs without the need for extensive parameter training or modification of the model architecture. We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques, further boosting defensive capabilities. Comprehensive experimental analyses provide insights into adversarial prompt tuning, a novel paradigm devoted to improving resistance to adversarial images through textual input modifications, paving the way for future robust multimodal learning research. These findings open up new possibilities for enhancing the security of VLMs. Our code is available at https://github.com/jiamingzhang94/Adversarial-Prompt-Tuning.

8/20/2024

Revisiting Backdoor Attacks against Large Vision-Language Models

Siyuan Liang, Jiawei Liang, Tianyu Pang, Chao Du, Aishan Liu, Ee-Chien Chang, Xiaochun Cao

Instruction tuning enhances large vision-language models (LVLMs) but raises security risks through potential backdoor attacks due to their openness. Previous backdoor studies focus on enclosed scenarios with consistent training and testing instructions, neglecting the practical domain gaps that could affect attack effectiveness. This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs for the first time, revealing certain limitations of most backdoor strategies in practical scenarios. We quantitatively evaluate the generalizability of six typical backdoor attacks on image caption benchmarks across multiple LVLMs, considering both visual and textual domain offsets. Our findings indicate that attack generalizability is positively correlated with the backdoor trigger's irrelevance to specific images/models and the preferential correlation of the trigger pattern. Additionally, we modify existing backdoor attacks based on the above key observations, demonstrating significant improvements in cross-domain scenario generalizability (+86% attack success rate). Notably, even without access to the instruction datasets, a multimodal instruction set can be successfully poisoned with a very low poisoning rate (0.2%), achieving an attack success rate of over 97%. This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs, necessitating more attention and in-depth research.

7/1/2024

Generative Visual Instruction Tuning

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

We propose to use machine-generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language, and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pre-trained models through instruction finetuning: LLaMA for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities on par with LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

6/18/2024