All Languages Matter: On the Multilingual Safety of Large Language Models

2310.00905

YC

0

Reddit

0

Published 6/21/2024 by Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

💬

Abstract

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • The paper focuses on the importance of developing and deploying large language models (LLMs) with a strong focus on safety.
  • Previous safety benchmarks have only considered safety in the majority language, such as English, but with the global deployment of LLMs, a multilingual safety benchmark is necessary.
  • The authors present the first multilingual safety benchmark, called XSafety, which covers 14 types of safety issues across 10 languages.
  • The authors use XSafety to study the multilingual safety of 4 widely-used LLMs, both closed-API and open-source models.
  • The results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, highlighting the need for improved safety alignment for non-English languages.
  • The authors propose simple and effective prompting methods to improve the multilingual safety of ChatGPT, reducing the ratio of unsafe responses for non-English queries from 19.1% to 9.7%.

Plain English Explanation

The paper focuses on an important issue in the development and use of large language models (LLMs): ensuring their safety across multiple languages. Previous safety tests for LLMs have only looked at how they perform in the dominant language, like English, but as these models are deployed globally, it's crucial to understand how they behave in a wider range of languages.

To address this, the researchers created a new multilingual safety benchmark called XSafety. This benchmark covers 14 different types of safety issues, like generating harmful or biased content, across 10 different languages.

Using this XSafety benchmark, the researchers tested 4 widely-used LLMs, both models that are closed-source (like ChatGPT) and open-source. The results showed that all the models performed significantly worse on safety metrics for non-English languages compared to English. This highlights the need for LLM developers to focus more on aligning the safety of these models across multiple languages, not just the dominant one.

To help address this, the researchers developed some simple prompting techniques that can improve the multilingual safety of ChatGPT. These prompting methods reduced the rate of unsafe responses for non-English queries from around 19% down to just 9.7%. This demonstrates that there are practical ways to enhance the multilingual safety of these powerful language models.

Overall, this research underscores the importance of thoroughly evaluating the safety of LLMs in a diverse range of languages, not just the most common ones. As these models become more widely used around the world, ensuring their multilingual safety will be crucial.

Technical Explanation

The paper presents the first multilingual safety benchmark for LLMs, called XSafety, which covers 14 types of safety issues across 10 languages spanning several language families. This is in contrast to previous safety benchmarks that have only focused on the majority language, such as English.

Using XSafety, the authors empirically study the multilingual safety of 4 widely-used LLMs, including both closed-API (e.g. ChatGPT) and open-source models. The experiments show that all LLMs produce significantly more unsafe responses for non-English queries compared to English ones. This indicates the necessity of developing safety alignment for non-English languages as these models are deployed globally.

To address this issue, the authors propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT. Their prompting techniques can evoke the model's safety knowledge and improve its cross-lingual generalization of safety alignment. Using these prompts, the authors are able to significantly reduce the ratio of unsafe responses for non-English queries from 19.1% to 9.7%.

The authors release their XSafety dataset to facilitate further research on multilingual safety for LLMs. This is an important step towards developing more robust and trustworthy language models that can be safely deployed globally.

Critical Analysis

The paper makes a valuable contribution by highlighting the need for multilingual safety evaluation of LLMs. The XSafety benchmark represents a significant step forward in this direction, providing a standardized way to assess safety across multiple languages.

However, the paper does not delve into the potential reasons why LLMs may perform worse on safety metrics for non-English languages. It would be helpful to understand if this is due to biases in the training data, limitations in the model architectures, or other factors. Exploring these underlying causes could inform more targeted solutions.

Additionally, the prompting techniques proposed in the paper, while effective, are relatively simple. There may be opportunities to develop more sophisticated prompting strategies or other approaches to further improve multilingual safety alignment. Exploring more advanced techniques could be a fruitful area for future research.

It's also worth noting that the XSafety benchmark, while comprehensive, may not capture the full breadth of safety concerns that can arise in real-world deployment scenarios. Expanding the benchmark to include a wider range of safety issues and use cases could strengthen the insights provided by this research.

Overall, this paper takes an important step forward in addressing the critical issue of multilingual safety for LLMs. Continued research and development in this area will be crucial as these powerful models are increasingly deployed around the world.

Conclusion

This paper highlights the importance of developing and deploying large language models (LLMs) with a strong focus on safety, particularly in a multilingual context. The authors present the first multilingual safety benchmark, called XSafety, which covers 14 types of safety issues across 10 languages.

Using XSafety, the researchers found that widely-used LLMs, including both closed-API and open-source models, perform significantly worse on safety metrics for non-English languages compared to English. This underscores the need for LLM developers to prioritize safety alignment across multiple languages, not just the dominant one.

To help address this issue, the authors proposed several simple and effective prompting methods that can improve the multilingual safety of ChatGPT. These techniques represent a practical way to enhance the safety of these powerful language models as they are deployed globally.

Overall, this research is an important step towards ensuring the safe and responsible development of LLMs that can be used effectively and equitably around the world. Continued work in this area will be crucial as these transformative technologies become more widely adopted.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang

YC

0

Reddit

0

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.

Read more

6/26/2024

💬

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao

YC

0

Reddit

0

The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at https://github.com/isXinLiu/MM-SafetyBench

Read more

6/21/2024

💬

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin

YC

0

Reddit

0

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.

Read more

5/28/2024

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang

YC

0

Reddit

0

Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.

Read more

6/14/2024