Adversarial Illusions in Multi-Modal Embeddings

Read original: arXiv:2308.11804 - Published 6/18/2024 by Tingwei Zhang, Rishi Jha, Eugene Bagdasaryan, Vitaly Shmatikov

📉

Overview

This paper explores a new type of attack called "adversarial illusions" that can compromise multi-modal AI systems.
Multi-modal embeddings encode different types of data, like text, images, and audio, into a single representation space, allowing associations between modalities.
The paper shows that an adversary can deliberately perturb an image or sound to make its embedding align with an arbitrary target in another modality.
These cross-modal, targeted attacks can undermine a wide range of downstream tasks, like image generation, text generation, and audio retrieval, even without knowledge of the specific task.

Plain English Explanation

Multi-modal AI systems can understand and work with different types of data, like text, images, and audio. These systems map all the different data types into a shared embedding space, allowing them to make connections between modalities - for example, linking an image of a dog to the sound of a dog barking.

This paper shows that these multi-modal AI systems can be vulnerable to a new type of attack called "adversarial illusions." An attacker can deliberately modify an image or sound in a way that tricks the system into thinking it's associated with something completely different. For instance, the attacker could alter an image to make the AI think it's associated with a particular sound, even though the original image and sound had no connection.

These adversarial illusions are cross-modal - they can connect data across different modalities. And they're targeted - the attacker can choose the specific target they want to align the image or sound with. This means the attacks can undermine all sorts of tasks, from image generation to text generation, without the attacker even knowing the details of the task.

The paper demonstrates these adversarial illusions using existing multi-modal AI models and shows they can be effective even in a "black-box" setting where the attacker doesn't have full access to the model. The researchers also explore ways to detect and defend against these types of attacks.

Technical Explanation

The paper explores a new attack called "adversarial illusions" that can compromise multi-modal AI systems. These systems encode data from different modalities, like text, images, thermal images, sounds, and videos, into a shared embedding space. This allows the systems to make connections between different types of data, like associating an image of a dog with the sound of a dog barking.

The researchers show that an adversary can deliberately perturb an image or sound to make its embedding close to an arbitrary, adversary-chosen input in another modality. These attacks are cross-modal, meaning they can connect data across different modalities, and targeted, allowing the adversary to align any image or sound with any target of their choice.

The paper demonstrates these adversarial illusions using two existing multi-modal embeddings, ImageBind and AudioCLIP. The attacks are shown to mislead a variety of downstream tasks, including image generation, text generation, zero-shot classification, and audio retrieval, without any knowledge of the specific tasks.

The researchers also investigate the transferability of these illusions across different multi-modal embeddings. They develop a black-box version of their method and use it to demonstrate the first adversarial alignment attack on Amazon's proprietary Titan embedding.

Finally, the paper analyzes potential countermeasures and evasion attacks. The researchers discuss the challenges in defending against these types of cross-modal, targeted attacks that are agnostic to downstream tasks and modalities.

Critical Analysis

The paper provides a comprehensive and technically detailed exploration of a new and concerning vulnerability in multi-modal AI systems. The adversarial illusion attacks highlight the potential risks of these systems, which can be undermined in broad and unpredictable ways by adversaries.

One limitation of the research is that it focuses on a relatively narrow set of multi-modal embeddings and downstream tasks. While the attacks are shown to be effective in these cases, it's unclear how they would scale to the full diversity of multi-modal AI applications and models.

Additionally, the paper doesn't delve deeply into the theoretical foundations of why these attacks are possible or the precise mechanisms by which the adversarial perturbations work. A more rigorous analysis of the underlying vulnerabilities could provide insight into more robust countermeasures.

The researchers do explore some potential defenses, but acknowledge the significant challenges in protecting against cross-modal, targeted attacks that are agnostic to downstream tasks. Further research is needed to develop comprehensive and practical mitigation strategies.

Overall, this paper makes an important contribution by shedding light on a new and concerning type of vulnerability in multi-modal AI. The findings should motivate further scrutiny of the security and robustness of these increasingly prevalent systems.

Conclusion

This paper uncovers a new type of attack called "adversarial illusions" that can compromise multi-modal AI systems. These attacks allow an adversary to deliberately perturb an image or sound to make its embedding align with an arbitrary target in another modality, like text or audio.

The cross-modal and targeted nature of these attacks means they can undermine a wide range of downstream tasks, even without knowledge of the specific applications. The researchers demonstrate the effectiveness of these attacks on various multi-modal AI models and tasks, highlighting the broader vulnerability of these systems.

The findings in this paper underscore the need for increased research and development of robust defenses against adversarial attacks on multi-modal AI. As these systems become more ubiquitous, ensuring their security and reliability will be crucial for their safe and ethical deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

Adversarial Illusions in Multi-Modal Embeddings

Tingwei Zhang, Rishi Jha, Eugene Bagdasaryan, Vitaly Shmatikov

Multi-modal embeddings encode texts, images, thermal images, sounds, and videos into a single embedding space, aligning representations across different modalities (e.g., associate an image of a dog with a barking sound). In this paper, we show that multi-modal embeddings can be vulnerable to an attack we call adversarial illusions. Given an image or a sound, an adversary can perturb it to make its embedding close to an arbitrary, adversary-chosen input in another modality. These attacks are cross-modal and targeted: the adversary can align any image or sound with any target of his choice. Adversarial illusions exploit proximity in the embedding space and are thus agnostic to downstream tasks and modalities, enabling a wholesale compromise of current and future tasks, as well as modalities not available to the adversary. Using ImageBind and AudioCLIP embeddings, we demonstrate how adversarially aligned inputs, generated without knowledge of specific downstream tasks, mislead image generation, text generation, zero-shot classification, and audio retrieval. We investigate transferability of illusions across different embeddings and develop a black-box version of our method that we use to demonstrate the first adversarial alignment attack on Amazon's commercial, proprietary Titan embedding. Finally, we analyze countermeasures and evasion attacks.

6/18/2024

Adversarial Attacks to Multi-Modal Models

Zhihao Dou, Xin Hu, Haibo Yang, Zhuqing Liu, Minghong Fang

Multi-modal models have gained significant attention due to their powerful capabilities. These models effectively align embeddings across diverse data modalities, showcasing superior performance in downstream tasks compared to their unimodal counterparts. Recent study showed that the attacker can manipulate an image or audio file by altering it in such a way that its embedding matches that of an attacker-chosen targeted input, thereby deceiving downstream models. However, this method often underperforms due to inherent disparities in data from different modalities. In this paper, we introduce CrossFire, an innovative approach to attack multi-modal models. CrossFire begins by transforming the targeted input chosen by the attacker into a format that matches the modality of the original image or audio file. We then formulate our attack as an optimization problem, aiming to minimize the angular deviation between the embeddings of the transformed input and the modified image or audio file. Solving this problem determines the perturbations to be added to the original media. Our extensive experiments on six real-world benchmark datasets reveal that CrossFire can significantly manipulate downstream tasks, surpassing existing attacks. Additionally, we evaluate six defensive strategies against CrossFire, finding that current defenses are insufficient to counteract our CrossFire.

9/12/2024

Unaligning Everything: Or Aligning Any Text to Any Image in Multimodal Models

Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu

Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. textbf{Warning: the text data used in this paper are toxic in nature and may be offensive to some readers.}

7/2/2024

Adversarial Attacks on Multimodal Agents

Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack

6/19/2024