CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

2406.10311

Published 6/18/2024 by Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Meijuan An, Bikun Yang, KaiKai Zhao, Kai Wang, Shiguo Lian

cs.CL cs.AI

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

Abstract

With the profound development of large language models(LLMs), their safety concerns have garnered increasing attention. However, there is a scarcity of Chinese safety benchmarks for LLMs, and the existing safety taxonomies are inadequate, lacking comprehensive safety detection capabilities in authentic Chinese scenarios. In this work, we introduce CHiSafetyBench, a dedicated safety benchmark for evaluating LLMs' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. CHiSafetyBench incorporates a dataset that covers a hierarchical Chinese safety taxonomy consisting of 5 risk areas and 31 categories. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Utilizing this benchmark, we validate the feasibility of automatic evaluation as a substitute for human evaluation and conduct comprehensive automatic safety assessments on mainstream Chinese LLMs. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities. Our dataset is publicly available at https://github.com/UnicomAI/DataSet/tree/main/TestData/Safety.

Create account to get full access

Overview

This paper introduces CHiSafetyBench, a new Chinese benchmark for evaluating the safety of large language models (LLMs).
The benchmark covers a hierarchical set of safety-critical tasks, including the detection of unsafe content, protecting against social biases, and maintaining ethical and factual integrity.
The authors evaluate several state-of-the-art Chinese LLMs on CHiSafetyBench, providing insights into the current capabilities and limitations of these models in terms of safety and reliability.

Plain English Explanation

The paper presents a new evaluation tool called CHiSafetyBench that is designed to test the safety and reliability of large language models (LLMs) that are trained on Chinese data. LLMs are AI systems that can generate human-like text, but they can sometimes produce content that is unsafe, biased, or factually inaccurate.

CHiSafetyBench includes a range of tasks that are meant to assess different aspects of an LLM's safety, such as its ability to detect and avoid generating harmful or inappropriate content, its resistance to encoding social biases, and its capacity to maintain ethical and factual integrity. The authors tested several state-of-the-art Chinese LLMs using this benchmark and reported on the models' strengths and weaknesses in terms of safety.

This research is important because as LLMs become more powerful and widely used, it is crucial to have reliable ways to evaluate their safety and ensure they are not causing harm. CHiSafetyBench provides a comprehensive framework for assessing the safety of Chinese LLMs, which can help researchers and developers identify areas for improvement and work towards building more trustworthy and reliable AI systems.

Technical Explanation

The paper introduces a new benchmark called CHiSafetyBench for evaluating the safety of large language models (LLMs) trained on Chinese data. The benchmark is designed to assess three key aspects of safety: 1) unsafe content detection, 2) social bias, and 3) ethical and factual integrity.

The authors evaluated several state-of-the-art Chinese LLMs, including CPM-2, ERNIE-3.0, and PALM, on the CHiSafetyBench tasks. The results provide insights into the current capabilities and limitations of these models in terms of safety and reliability.

The benchmark includes a hierarchical set of tasks, where higher-level tasks build upon lower-level ones. For example, the unsafe content detection tasks start with simple binary classification (e.g., identifying hate speech), and progress to more nuanced forms of unsafe content (e.g., detecting misinformation, offensive language, and explicit sexual or violent content). The social bias tasks assess the models' ability to avoid encoding common biases, such as gender and racial biases. The ethical and factual integrity tasks evaluate the models' adherence to ethical principles and their factual accuracy.

The authors' evaluation of the Chinese LLMs on CHiSafetyBench reveals that while the models generally perform well on simpler safety tasks, they struggle with more complex and nuanced challenges. The results highlight the need for continued research and development to improve the safety and reliability of these powerful AI systems.

Critical Analysis

The authors of the paper have made a valuable contribution to the field of AI safety by developing CHiSafetyBench, a comprehensive benchmark for evaluating the safety of Chinese large language models (LLMs). The benchmark covers a wide range of safety-critical tasks, which is a significant strength of the research.

However, one potential limitation of the study is the relatively small number of LLMs evaluated. While the authors tested several state-of-the-art Chinese models, a more comprehensive evaluation with a broader range of models, including those from different research groups and companies, could provide a more complete picture of the current state of safety in Chinese LLMs.

Additionally, the paper does not delve deeply into the specific reasons why the tested models struggled with more complex safety tasks. Further analysis and discussion of the underlying factors that contribute to the models' performance, such as architectural limitations, training data biases, or algorithmic shortcomings, could help guide future research and development efforts to address these challenges.

It would also be valuable for the authors to [compare the performance of the Chinese LLMs on CHiSafetyBench to the performance of LLMs evaluated on other safety benchmarks, such as MMSafeBench, S-Eval, or ALERT. This could provide a broader context for understanding the strengths and weaknesses of the Chinese LLMs in relation to LLMs from other regions or language domains.

Overall, the CHiSafetyBench benchmark represents an important step forward in the effort to ensure the safety and reliability of large language models, particularly in the context of the Chinese language and culture. The insights gained from this research can inform the development of more robust and trustworthy AI systems that can be safely deployed in real-world applications.

Conclusion

The paper presents CHiSafetyBench, a new benchmark for evaluating the safety of large language models (LLMs) trained on Chinese data. The benchmark covers a hierarchical set of safety-critical tasks, including the detection of unsafe content, protecting against social biases, and maintaining ethical and factual integrity.

The authors' evaluation of several state-of-the-art Chinese LLMs on CHiSafetyBench provides valuable insights into the current capabilities and limitations of these models in terms of safety and reliability. The results highlight the need for continued research and development to improve the safety and trustworthiness of these powerful AI systems.

Overall, the CHiSafetyBench benchmark represents an important contribution to the field of AI safety, particularly in the context of the Chinese language and culture. As LLMs become more widespread and influential, it is crucial to have reliable tools like CHiSafetyBench to assess their safety and ensure they are not causing harm. The insights gained from this research can inform the development of more robust and trustworthy AI systems that can be safely deployed in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.

6/26/2024

cs.CL

💬

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.

5/28/2024

cs.CL

💬

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao

The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at https://github.com/isXinLiu/MM-SafetyBench

6/21/2024

cs.CV

💬

All Languages Matter: On the Multilingual Safety of Large Language Models

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.

6/21/2024

cs.CL cs.AI