MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

2311.17600

Published 6/21/2024 by Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao

💬

Abstract

The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at https://github.com/isXinLiu/MM-SafetyBench

Create account to get full access

Overview

The paper explores the security risks of Multimodal Large Language Models (MLLMs), which combine text and images.
The researchers introduce MM-SafetyBench, a framework for evaluating the safety of MLLMs against image-based attacks.
They find that current state-of-the-art MLLMs are vulnerable to such attacks, even when the underlying language models have been safety-aligned.
The paper proposes a prompting strategy to enhance the resilience of MLLMs against these types of malicious exploits.

Plain English Explanation

Large language models (LLMs) like GPT-3 have become powerful tools for a wide range of applications, from text generation to question answering. However, these models can also be vulnerable to various security threats, such as malicious prompts or unintended behaviors.

The researchers in this paper wanted to explore the security of a newer type of model called a Multimodal Large Language Model (MLLM), which can process both text and images. They hypothesized that these models could be easily manipulated by malicious image inputs, even if the text prompts themselves were not harmful.

To test this, the researchers created a comprehensive framework called MM-SafetyBench, which includes a dataset of 5,040 text-image pairs designed to evaluate the safety of MLLMs. When they tested 12 state-of-the-art MLLM models, they found that all of them were susceptible to these types of image-based attacks, even when the underlying language models had been "safety-aligned" to reduce harmful outputs.

In response, the researchers propose a simple prompting strategy that can help make MLLMs more resilient against these attacks. This work highlights the importance of continuing to develop robust safety measures for these powerful AI models, especially as they become more sophisticated and widely used.

Technical Explanation

The researchers first observed that while the security concerns surrounding Large Language Models (LLMs) have been extensively studied, the safety of Multimodal Large Language Models (MLLMs) remains relatively understudied. To address this gap, they introduced a comprehensive evaluation framework called MM-SafetyBench.

MM-SafetyBench consists of a dataset of 13 different scenarios, resulting in a total of 5,040 text-image pairs. These pairs were designed to test whether MLLMs could be easily compromised by query-relevant images, as if the text prompt itself were malicious.

The researchers then evaluated 12 state-of-the-art MLLM models using this MM-SafetyBench framework. Their analysis revealed that these models were indeed susceptible to the image-based manipulations, even when the underlying LLMs had been safety-aligned using techniques like prompt engineering or constrained language modeling.

In response, the researchers proposed a simple prompting strategy that can enhance the resilience of MLLMs against these types of attacks. This work underscores the need for continued efforts to strengthen the safety measures of open-source MLLMs, as they become increasingly prevalent in real-world applications.

Critical Analysis

While the researchers have made a valuable contribution by highlighting the security vulnerabilities of MLLMs, their study does have some limitations. For instance, the dataset they created, while comprehensive, may not capture the full range of potential attack vectors that could be used against these models in real-world scenarios.

Additionally, the prompting strategy proposed by the researchers, while effective, may not be a complete solution to the problem. There may be other approaches, such as improved model architectures or training procedures, that could further enhance the safety and robustness of MLLMs.

It's also important to note that the security of AI systems is a complex and constantly evolving challenge, and that continued research and collaboration between academia, industry, and policymakers will be crucial in addressing these issues effectively.

Conclusion

This paper sheds important light on the security vulnerabilities of Multimodal Large Language Models (MLLMs), which combine text and image processing capabilities. The researchers have introduced a comprehensive evaluation framework, MM-SafetyBench, and have demonstrated that current state-of-the-art MLLM models are susceptible to image-based manipulations, even when the underlying language models have been safety-aligned.

In response, the researchers have proposed a simple prompting strategy to enhance the resilience of MLLMs against these types of attacks. This work underscores the critical need for ongoing efforts to strengthen the safety and security measures of these powerful AI systems, as they become increasingly integrated into a wide range of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Safety of Multimodal Large Language Models on Images and Texts

Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao

Attracted by the impressive power of Multimodal Large Language Models (MLLMs), the public is increasingly utilizing them to improve the efficiency of daily work. Nonetheless, the vulnerabilities of MLLMs to unsafe instructions bring huge safety risks when these models are deployed in real-world scenarios. In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text. We begin with introducing the overview of MLLMs on images and text and understanding of safety, which helps researchers know the detailed scope of our survey. Then, we review the evaluation datasets and metrics for measuring the safety of MLLMs. Next, we comprehensively present attack and defense techniques related to MLLMs' safety. Finally, we analyze several unsolved issues and discuss promising research directions. The latest papers are continually collected at https://github.com/isXinLiu/MLLM-Safety-Collection.

6/21/2024

cs.CV

💬

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.

6/26/2024

cs.CL

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang

Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.

6/14/2024

cs.CR cs.AI

💬

All Languages Matter: On the Multilingual Safety of Large Language Models

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.

6/21/2024

cs.CL cs.AI