A Chinese Dataset for Evaluating the Safeguards in Large Language Models

2402.12193

Published 5/28/2024 by Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin

cs.CL

💬

Abstract

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.

Create account to get full access

Overview

This paper examines the potential risks and safety issues associated with large language models (LLMs) in Chinese, in contrast to the prior focus on English.
The researchers introduce a new dataset for evaluating the safety of Chinese LLMs, and extend it to include additional scenarios to better identify false positives and false negatives in terms of risky prompt rejections.
They also present a set of detailed safety assessment criteria for different risk types, to enable both manual annotation and automatic evaluation of LLM response harmfulness.
Experiments on five Chinese LLMs show that region-specific risks are the prevalent issue, posing major challenges for these models.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can generate human-like text. However, previous research has shown that these models can sometimes produce harmful or biased responses, exposing users to unexpected risks.

This paper aims to address the safety and risk issues with LLMs in the Chinese language, which has received less attention compared to English. The researchers created a new dataset to evaluate the safety of Chinese LLMs, and expanded it to include additional scenarios. This helps identify cases where the models incorrectly flag safe prompts as risky (false positives), as well as cases where they fail to detect harmful prompts (false negatives).

The researchers also developed a detailed set of criteria to assess different types of risks, such as generating biased or discriminatory content, or making exaggerated safety claims. These criteria can be used for both manual review and automated evaluation of LLM responses.

When tested on five Chinese LLMs, the researchers found that the prevalent issue was region-specific risks, which presented major challenges for all the models they examined. This suggests that the safety and risk profiles of LLMs can vary significantly across different languages and cultural contexts.

Technical Explanation

The paper first introduces a dataset for evaluating the safety of Chinese LLMs, addressing the lack of research on non-English languages. This dataset includes a variety of prompts designed to elicit potentially harmful or biased responses from the models, covering different risk categories such as generating false medical advice or promoting extremist views.

To better assess the models' safety mechanisms, the researchers then extended the dataset to include two additional scenarios. The first scenario tests for false positives, where the model incorrectly flags a safe prompt as risky. The second scenario tests for false negatives, where the model fails to detect a harmful prompt.

The paper also presents a set of fine-grained safety assessment criteria, covering different risk types such as offensive language, hate speech, biased content, and factual inaccuracies. These criteria can be used for both manual annotation and automatic evaluation of LLM responses.

The experiments were conducted on five Chinese LLMs, and the results showed that region-specific risks were the predominant issue, with all the models exhibiting significant challenges in this area. This suggests that the safety and risk profiles of LLMs can vary greatly across languages and cultural contexts, highlighting the need for more comprehensive, multilingual safety evaluations.

Critical Analysis

The paper's focus on Chinese LLMs is a valuable contribution, as the majority of prior research has been limited to English. By introducing a new dataset and safety assessment framework for Chinese, the researchers have taken an important step towards addressing the language-specific challenges in this domain.

However, the paper does not provide much detail on the specific types of region-specific risks identified in the Chinese LLMs. It would be helpful to have a more nuanced understanding of the nature and severity of these risks, as well as their potential societal implications.

Additionally, the paper does not explore the potential reasons why Chinese LLMs appear to struggle more with region-specific risks compared to other risk categories. Further investigation into the underlying factors, such as training data biases or model architecture choices, could provide valuable insights for improving the safety and robustness of these systems.

It's also worth noting that the paper's findings are based on a limited set of five Chinese LLMs, and the researchers acknowledge that the dataset and safety criteria may not be comprehensive. Expanding the evaluation to a wider range of models and continuing to refine the assessment framework could lead to a more holistic understanding of the safety challenges in this domain.

Conclusion

This paper makes an important contribution to the growing body of research on the safety and risk issues associated with large language models. By focusing on the Chinese language, the researchers have highlighted the need for a more diverse and comprehensive approach to evaluating the potential harms of these powerful AI systems.

The introduction of a new dataset and fine-grained safety assessment criteria provides a valuable tool for both manual and automated evaluation of Chinese LLMs. The finding that region-specific risks are a prevalent issue for these models underscores the importance of considering language and cultural context when assessing the safety and robustness of LLMs.

As the use of LLMs continues to expand, it will be crucial for researchers, developers, and policymakers to work together to address the safety challenges identified in this and other studies. Ongoing efforts to mitigate linguistic discrimination and improve the safety of LLMs will be essential for ensuring these powerful AI systems are deployed responsibly and equitably across diverse linguistic and cultural contexts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Meijuan An, Bikun Yang, KaiKai Zhao, Kai Wang, Shiguo Lian

With the profound development of large language models(LLMs), their safety concerns have garnered increasing attention. However, there is a scarcity of Chinese safety benchmarks for LLMs, and the existing safety taxonomies are inadequate, lacking comprehensive safety detection capabilities in authentic Chinese scenarios. In this work, we introduce CHiSafetyBench, a dedicated safety benchmark for evaluating LLMs' capabilities in identifying risky content and refusing answering risky questions in Chinese contexts. CHiSafetyBench incorporates a dataset that covers a hierarchical Chinese safety taxonomy consisting of 5 risk areas and 31 categories. This dataset comprises two types of tasks: multiple-choice questions and question-answering, evaluating LLMs from the perspectives of risk content identification and the ability to refuse answering risky questions respectively. Utilizing this benchmark, we validate the feasibility of automatic evaluation as a substitute for human evaluation and conduct comprehensive automatic safety assessments on mainstream Chinese LLMs. Our experiments reveal the varying performance of different models across various safety domains, indicating that all models possess considerable potential for improvement in Chinese safety capabilities. Our dataset is publicly available at https://github.com/UnicomAI/DataSet/tree/main/TestData/Safety.

6/18/2024

cs.CL cs.AI

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

Paul Rottger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy

The last two years have seen a rapid growth in concerns around the safety of large language models (LLMs). Researchers and practitioners have met these concerns by introducing an abundance of new datasets for evaluating and improving LLM safety. However, much of this work has happened in parallel, and with very different goals in mind, ranging from the mitigation of near-term risks around bias and toxic content generation to the assessment of longer-term catastrophic risk potential. This makes it difficult for researchers and practitioners to find the most relevant datasets for a given use case, and to identify gaps in dataset coverage that future work may fill. To remedy these issues, we conduct a first systematic review of open datasets for evaluating and improving LLM safety. We review 102 datasets, which we identified through an iterative and community-driven process over the course of several months. We highlight patterns and trends, such as a a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English datasets. We also examine how LLM safety datasets are used in practice -- in LLM release publications and popular LLM benchmarks -- finding that current evaluation practices are highly idiosyncratic and make use of only a small fraction of available datasets. Our contributions are based on SafetyPrompts.com, a living catalogue of open datasets for LLM safety, which we commit to updating continuously as the field of LLM safety develops.

4/9/2024

cs.CL cs.AI

💬

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.

6/26/2024

cs.CL

💬

All Languages Matter: On the Multilingual Safety of Large Language Models

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.

6/21/2024

cs.CL cs.AI