When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Read original: arXiv:2407.15211 - Published 7/23/2024 by Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Crist'obal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes and 5 others

🖼️

Overview

New AI models that can process both images and text offer exciting capabilities, but also raise concerns about potential vulnerabilities to adversarial attacks.
This study focuses on a class of AI models called vision-language models (VLMs), which generate text outputs based on visual and textual inputs.
The researchers conducted a large-scale investigation to assess the transferability of gradient-based "jailbreak" attacks on over 40 open-parameter VLMs.

Plain English Explanation

The integration of visual and language processing capabilities into AI systems has led to the development of powerful vision-language models that can generate text outputs based on both images and text. While these models offer exciting new capabilities, there are concerns that they may be vulnerable to adversarial attacks that could manipulate their behavior in undesirable ways.

In this study, the researchers focused on a popular class of vision-language models and conducted a large-scale investigation to assess how well "jailbreak" attacks that work on one model can transfer to other models. Jailbreak attacks are designed to bypass the intended behavior and outputs of an AI system. The researchers tested this transferability across a diverse set of over 40 open-parameter VLMs, including 18 new models that they released publicly.

The key finding is that it is extremely difficult to obtain transferable gradient-based jailbreak attacks against these VLMs. When a jailbreak attack is optimized against a single VLM or an ensemble of VLMs, it successfully jailbreaks the attacked model(s), but exhibits little to no transfer to any other VLMs. This lack of transferability holds true regardless of factors like whether the models share the same vision or language components, or whether the language model has undergone specialized training for safety and instruction-following.

Technical Explanation

The researchers conducted a large-scale empirical study to assess the transferability of gradient-based universal image jailbreaks across a diverse set of over 40 open-parameter vision-language models (VLMs), including 18 new VLMs that they publicly released.

They found that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs. This lack of transfer is not affected by factors such as whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other characteristics.

The researchers did identify two settings where they observed partially successful transfer:

Between identically-pretrained and identically-initialized VLMs with slightly different VLM training data.
Between different training checkpoints of a single VLM.

Leveraging these insights, the researchers demonstrated that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of highly-similar VLMs.

These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

Critical Analysis

The researchers provide a thorough and rigorous empirical investigation of the transferability of gradient-based jailbreak attacks against a diverse set of vision-language models. Their findings suggest that these models may be more resilient to such attacks compared to other AI systems, which is an important insight for the field.

However, the study is limited to a specific type of attack (gradient-based) and a particular class of AI models (VLMs). It would be valuable to explore the transferability of other forms of jailbreak attacks, such as those that leverage different optimization techniques or exploit different vulnerabilities. Additionally, the researchers note that their findings may not generalize to other multimodal AI systems beyond VLMs.

Further research is needed to fully understand the security and robustness of these emerging AI technologies. Investigating other attack vectors, exploring the underlying reasons for the observed lack of transferability, and expanding the analysis to a broader range of multimodal models could provide valuable insights for developing more secure and reliable AI systems.

Conclusion

This study provides a substantial contribution to our understanding of the security of vision-language models (VLMs), a rapidly advancing class of AI systems that can process both visual and textual inputs. The researchers conducted a large-scale empirical investigation and found that transferable gradient-based jailbreak attacks are extremely difficult to obtain against VLMs, in contrast to the success of such attacks against other AI models.

These results suggest that VLMs may be more robust to certain types of adversarial manipulation, which has important implications for the continued development and deployment of these powerful AI systems. As the integration of visual and language processing capabilities in AI becomes more prevalent, understanding and addressing potential vulnerabilities will be crucial to ensuring the safety and reliability of these technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Crist'obal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez

The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image jailbreaks using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or language models, whether the language model underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles of highly-similar VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against language models and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

7/23/2024

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang

Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies.

5/29/2024

Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

Xiaotian Zou, Ke Li, Yongkang Chen

Large Visual Language Modeltextbfs (VLMs) such as GPT-4V have achieved remarkable success in generating comprehensive and nuanced responses. Researchers have proposed various benchmarks for evaluating the capabilities of VLMs. With the integration of visual and text inputs in VLMs, new security issues emerge, as malicious attackers can exploit multiple modalities to achieve their objectives. This has led to increasing attention on the vulnerabilities of VLMs to jailbreak. Most existing research focuses on generating adversarial images or nonsensical image to jailbreak these models. However, no researchers evaluate whether logic understanding capabilities of VLMs in flowchart can influence jailbreak. Therefore, to fill this gap, this paper first introduces a novel dataset Flow-JD specifically designed to evaluate the logic-based flowchart jailbreak capabilities of VLMs. We conduct an extensive evaluation on GPT-4o, GPT-4V, other 5 SOTA open source VLMs and the jailbreak rate is up to 92.8%. Our research reveals significant vulnerabilities in current VLMs concerning image-to-text jailbreak and these findings underscore the the urgency for the development of robust and effective future defenses.

8/28/2024

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao

In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.

7/2/2024