SafetyBench: Evaluating the Safety of Large Language Models

2309.07045

Published 6/26/2024 by Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang

cs.CL

💬

Abstract

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.

Create account to get full access

Overview

The rapid development of Large Language Models (LLMs) has led to increasing concerns about their safety.
Evaluating the safety of LLMs is essential for enabling their widespread applications, but the lack of comprehensive safety evaluation benchmarks poses a significant challenge.
This paper presents SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which includes 11,435 diverse multiple choice questions across 7 safety categories in both Chinese and English.
The authors' extensive tests on 25 popular Chinese and English LLMs reveal a substantial performance advantage for GPT-4, but also significant room for improving the safety of current LLMs.

Plain English Explanation

As Large Language Models (LLMs) become more advanced, there are growing concerns about their safety. It's crucial to be able to assess how safe these models are before they're used widely. However, there hasn't been a comprehensive way to evaluate the safety of LLMs.

The researchers created a new tool called SafetyBench to address this problem. SafetyBench includes over 11,000 multiple-choice questions covering 7 different areas of safety, in both Chinese and English. The researchers used this benchmark to test 25 popular LLMs, and found that GPT-4 performed the best. But the results also showed that there is still a lot of room for improving the safety of current LLMs.

This work is an important step in ensuring that as LLMs become more powerful and widely used, we can be confident that they are behaving safely and responsibly. By providing a standardized way to evaluate safety, SafetyBench can help researchers and developers make these models more reliable and trustworthy.

Technical Explanation

The authors present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs. SafetyBench comprises 11,435 diverse multiple choice questions spanning 7 distinct categories of safety concerns, including factual accuracy, commonsense reasoning, social awareness, and ethical decision-making. Notably, the benchmark includes data in both Chinese and English, enabling the evaluation of LLMs in multiple languages.

The authors conduct extensive tests on 25 popular Chinese and English LLMs, including GPT-3, GPT-4, and several Chinese models, in both zero-shot and few-shot settings. The results reveal a substantial performance advantage for GPT-4 over its counterparts, suggesting that it has a more robust understanding of safety-related concepts. However, the study also highlights the significant room for improvement in the safety of current LLMs.

Additionally, the authors demonstrate that the measured safety understanding abilities in SafetyBench are correlated with the models' safety generation abilities, underscoring the benchmark's relevance and utility in assessing the real-world safety of LLMs.

Critical Analysis

The authors provide a comprehensive and well-designed benchmark for evaluating the safety of LLMs, which is a crucial step in enabling the broad deployment of these powerful models. However, the study acknowledges several limitations and areas for further research.

First, the authors note that while SafetyBench covers a diverse range of safety concerns, it may not capture all possible safety-related aspects. As the field of AI safety continues to evolve, the benchmark may need to be expanded and updated to stay relevant.

Additionally, the authors emphasize that the benchmark is primarily focused on language-based safety, and does not address the safety of multimodal LLMs that incorporate visual or other sensory inputs. Future work could explore the development of multimodal safety benchmarks to address this gap.

Finally, the study raises the important question of how to translate the safety understanding abilities measured by SafetyBench into real-world safety guarantees. While the observed correlation with safety generation abilities is promising, more research is needed to fully understand the relationship between these different safety aspects and to develop comprehensive safety evaluation suites for LLMs.

Conclusion

This paper presents a significant step forward in the field of LLM safety evaluation. By introducing SafetyBench, a comprehensive and multilingual benchmark for assessing the safety of LLMs, the authors have provided a valuable tool for researchers and developers to measure and improve the safety of these powerful models.

The study's findings highlight the current capabilities and limitations of popular LLMs, with GPT-4 showing a clear advantage in safety-related understanding. However, the results also underscore the need for continued research and development to enhance the overall safety of LLMs, which is crucial for enabling their broad and responsible deployment.

As the field of AI safety evolves, further advancements in safety evaluation methodologies and the incorporation of multilingual and multimodal perspectives will be essential to ensure the safe and ethical development of LLMs that can benefit society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Meijuan An, Bikun Yang, KaiKai Zhao, Kai Wang, Shiguo Lian

With the profound development of large language models(LLMs), their safety concerns have garnered increasing attention. However, there is a scarcity of Chinese safety benchmarks for LLMs, and the existing safety taxonomies are inadequate, lacking comprehensive safety detection capabilities in authentic Chinese scenarios. In this work, we introduce CHiSafetyBench, a dedicated safety benchmark for evaluating LLMs' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. CHiSafetyBench incorporates a dataset that covers a hierarchical Chinese safety taxonomy consisting of 5 risk areas and 31 categories. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Utilizing this benchmark, we validate the feasibility of automatic evaluation as a substitute for human evaluation and conduct comprehensive automatic safety assessments on mainstream Chinese LLMs. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities. Our dataset is publicly available at https://github.com/UnicomAI/DataSet/tree/main/TestData/Safety.

6/18/2024

cs.CL cs.AI

💬

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao

The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at https://github.com/isXinLiu/MM-SafetyBench

6/21/2024

cs.CV

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Hui Xue, Wenhai Wang, Kui Ren, Jingyi Wang

Large Language Models have gained considerable attention for their revolutionary capabilities. However, there is also growing concern on their safety implications, making a comprehensive safety evaluation for LLMs urgently needed before model deployment. In this work, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. At the core of S-Eval is a novel LLM-based automatic test prompt generation and selection framework, which trains an expert testing LLM Mt combined with a range of test selection strategies to automatically construct a high-quality test suite for the safety evaluation. The key to the automation of this process is a novel expert safety-critique LLM Mc able to quantify the riskiness score of an LLM's response, and additionally produce risk tags and explanations. Besides, the generation process is also guided by a carefully designed risk taxonomy with four different levels, covering comprehensive and multi-dimensional safety risks of concern. Based on these, we systematically construct a new and large-scale safety evaluation benchmark for LLMs consisting of 220,000 evaluation prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks against LLMs. Moreover, considering the rapid evolution of LLMs and accompanied safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks and models. S-Eval is extensively evaluated on 20 popular and representative LLMs. The results confirm that S-Eval can better reflect and inform the safety risks of LLMs compared to existing benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs.

5/29/2024

cs.CR cs.CL

💬

All Languages Matter: On the Multilingual Safety of Large Language Models

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.

6/21/2024

cs.CL cs.AI