Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Read original: arXiv:2406.04031 - Published 7/2/2024 by Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Overview

This paper presents a novel approach to "jailbreaking" or bypassing the safety constraints of large vision language models (VLMs) using a bi-modal adversarial prompt.
The researchers demonstrate how to generate prompts that can manipulate VLMs to produce outputs that violate their intended behavior, such as generating explicit or harmful content.
This work builds on prior research on jailbreaking attacks against large language models and visual analysis of jailbreak attacks.

Plain English Explanation

Vision language models (VLMs) are AI systems that can understand and generate both images and text. They are often used for tasks like image captioning, visual question answering, and text-to-image generation. However, these models can also be "jailbroken" - that is, manipulated to produce outputs that violate their intended behavior, such as generating explicit or harmful content.

In this paper, the researchers show how to create special "adversarial prompts" that can jailbreak VLMs. These prompts contain a combination of text and images designed to trick the model into generating unintended outputs. By carefully crafting the prompt, the researchers were able to bypass the safety constraints built into the VLM and get it to produce content that the model's creators never intended.

This work builds on previous research that has explored jailbreaking attacks against large language models and visual analysis of jailbreak attacks. The key insight is that by using both text and images in the prompt, the researchers can more effectively manipulate the VLM and uncover its vulnerabilities.

Technical Explanation

The researchers propose a novel technique for jailbreaking vision language models (VLMs) using a bi-modal adversarial prompt. The approach involves crafting a prompt that combines both text and images in a way that can trick the VLM into generating unintended outputs.

The researchers first conduct a systematic analysis of the VLM's behavior and identify potential vulnerabilities. They then design a bi-modal prompt that exploits these vulnerabilities by carefully selecting the text and images to include. The prompts are optimized using gradient-based techniques to maximize the likelihood of the VLM producing the desired (i.e., unintended) output.

The researchers evaluate their approach on several popular VLM architectures, including CLIP and DALL-E. They demonstrate that their bi-modal adversarial prompts can successfully jailbreak these models, causing them to generate explicit, harmful, or otherwise unintended content.

The key insight of this work is that by combining both text and images in the prompt, the researchers can more effectively manipulate the VLM and uncover its vulnerabilities. This bi-modal approach allows them to bypass the safety constraints built into the model and push it to produce outputs that violate its intended behavior.

Critical Analysis

The researchers provide a thorough and well-executed investigation of jailbreaking vision language models using bi-modal adversarial prompts. The work builds on a strong foundation of prior research in this area and makes a significant contribution to our understanding of the vulnerabilities of these models.

However, it's important to note that the research also raises some important ethical concerns. The ability to jailbreak VLMs and generate harmful or explicit content could potentially be misused by bad actors. The researchers acknowledge this risk and discuss the need for continued research into model safety and robustness.

Additionally, the paper does not fully address the limitations of the proposed approach. For example, it's unclear how well the bi-modal adversarial prompts would generalize to a wider range of VLM architectures or how robust the technique would be to potential countermeasures developed by model developers.

Further research is needed to better understand the broader implications of this work and to explore more effective ways of ensuring the safety and reliability of vision language models. As this technology continues to advance, it will be crucial to prioritize the development of robust safeguards and mechanisms to prevent the misuse of these powerful AI systems.

Conclusion

This paper presents a novel approach to jailbreaking vision language models using bi-modal adversarial prompts. The researchers demonstrate how carefully crafted prompts that combine text and images can be used to bypass the safety constraints of these models and generate unintended outputs.

While this work advances our understanding of the vulnerabilities of VLMs, it also raises important ethical concerns that must be addressed. As these models become increasingly powerful and widespread, it will be critical to develop robust safeguards and mechanisms to ensure they are used responsibly and in alignment with societal values.

Overall, this paper provides a valuable contribution to the ongoing research on the security and reliability of large-scale AI systems. Its findings underscore the need for continued vigilance and innovation in this rapidly evolving field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao

In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.

7/2/2024

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang

Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methods mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. Different from existing attacks, in this work we propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Specifically, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the vulnerability of VLMs and the urgent need for new alignment strategies.

5/29/2024

Efficient LLM-Jailbreaking by Introducing Visual Modality

Zhenxing Niu, Yuyao Sun, Haodong Ren, Haoxuan Ji, Quan Wang, Xiaoke Ma, Gang Hua, Rong Jin

This paper focuses on jailbreaking attacks against large language models (LLMs), eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreaks that directly orient to LLMs, our approach begins by constructing a multimodal large language model (MLLM) through the incorporation of a visual module into the target LLM. Subsequently, we conduct an efficient MLLM-jailbreak to generate jailbreaking embeddings embJS. Finally, we convert the embJS into text space to facilitate the jailbreaking of the target LLM. Compared to direct LLM-jailbreaking, our approach is more efficient, as MLLMs are more vulnerable to jailbreaking than pure LLM. Additionally, to improve the attack success rate (ASR) of jailbreaking, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class jailbreaking capabilities.

5/31/2024

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen

The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

4/16/2024