SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

2404.05399

YC

0

Reddit

0

Published 4/9/2024 by Paul Rottger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety

Abstract

The last two years have seen a rapid growth in concerns around the safety of large language models (LLMs). Researchers and practitioners have met these concerns by introducing an abundance of new datasets for evaluating and improving LLM safety. However, much of this work has happened in parallel, and with very different goals in mind, ranging from the mitigation of near-term risks around bias and toxic content generation to the assessment of longer-term catastrophic risk potential. This makes it difficult for researchers and practitioners to find the most relevant datasets for a given use case, and to identify gaps in dataset coverage that future work may fill. To remedy these issues, we conduct a first systematic review of open datasets for evaluating and improving LLM safety. We review 102 datasets, which we identified through an iterative and community-driven process over the course of several months. We highlight patterns and trends, such as a a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English datasets. We also examine how LLM safety datasets are used in practice -- in LLM release publications and popular LLM benchmarks -- finding that current evaluation practices are highly idiosyncratic and make use of only a small fraction of available datasets. Our contributions are based on SafetyPrompts.com, a living catalogue of open datasets for LLM safety, which we commit to updating continuously as the field of LLM safety develops.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper presents a systematic review of open datasets that can be used to evaluate and improve the safety of large language models (LLMs).
  • The authors introduce the concept of "SafetyPrompts" - a framework for assessing the safety and reliability of LLMs across various domains.
  • The paper explores a range of open datasets that can be used to benchmark the safety and ethical behavior of LLMs, including datasets related to developing safe and responsible LLMs, biomedical language understanding, introductory computer science concepts, and safety generalization challenges.

Plain English Explanation

This paper looks at different datasets that can be used to test the safety and reliability of large language models (LLMs) - the powerful AI systems that can generate human-like text. The authors introduce the idea of "SafetyPrompts" - a way to assess how well these LLMs behave when given certain types of prompts or instructions.

The paper explores a range of open datasets that researchers can use to see how LLMs respond to different situations. For example, there are datasets that test how well LLMs can understand and respond to biomedical information without making dangerous mistakes. There are also datasets that check if LLMs can grasp basic computer science concepts correctly. And there are datasets that look at how LLMs handle prompts related to sensitive or harmful topics.

By using these diverse datasets, researchers can get a better sense of the strengths and weaknesses of LLMs when it comes to safety and ethical behavior. This is important as these models become more powerful and widely used. The paper aims to provide a roadmap for researchers and developers to systematically evaluate and improve the safety of LLMs.

Technical Explanation

The paper presents a systematic review of open datasets that can be used to evaluate and improve the safety of large language models (LLMs). The authors introduce the concept of "SafetyPrompts" - a framework for assessing the safety and reliability of LLMs across various domains.

The authors conducted a thorough literature review to identify relevant open datasets. They categorized the datasets based on the type of safety and ethics evaluation they enable, such as:

  1. Developing safe and responsible LLMs: Datasets that assess an LLM's ability to generate safe and ethical responses to a wide range of prompts.

  2. Biomedical language understanding: Datasets that evaluate an LLM's capacity to understand and respond to biomedical information without making dangerous mistakes.

  3. Introductory computer science concepts: Datasets that assess an LLM's grasp of fundamental computer science principles and its ability to provide accurate and appropriate responses.

  4. Safety generalization challenges: Datasets that explore the safety and robustness of LLMs when faced with prompts or situations that test the models' ability to generalize their understanding and behavior.

The paper also discusses the potential applications of these datasets in auditing large language models for safety and ethical compliance.

Critical Analysis

The paper provides a comprehensive overview of the current landscape of open datasets for evaluating the safety and ethical behavior of large language models (LLMs). However, the authors acknowledge that the field is rapidly evolving, and new datasets and benchmarks are likely to emerge over time.

One potential limitation of the review is that it focuses primarily on publicly available datasets, which may not capture the full range of safety and ethics evaluation efforts being conducted by private companies and research institutions. Additionally, the authors note that the datasets may have biases or limitations that need to be carefully considered when using them for LLM evaluation.

The paper also highlights the ongoing challenge of ensuring the safety and reliability of LLMs as they become more sophisticated and widely deployed. The authors suggest that a systematic, multi-faceted approach to safety and ethics evaluation, as outlined in the "SafetyPrompts" framework, is crucial for addressing this challenge.

Conclusion

This paper presents a valuable resource for researchers and developers working on the safety and ethical development of large language models (LLMs). By systematically reviewing a range of open datasets, the authors have provided a roadmap for evaluating and improving the safety and reliability of these powerful AI systems.

The "SafetyPrompts" framework introduced in the paper offers a structured approach to assessing LLM safety across different domains, from biomedical language understanding to fundamental computer science concepts. As LLMs continue to advance and become more ubiquitous, the insights and recommendations provided in this paper will be crucial for ensuring that these technologies are developed and deployed in a responsible and trustworthy manner.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin

YC

0

Reddit

0

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.

Read more

5/28/2024

Towards Safe Large Language Models for Medicine

Towards Safe Large Language Models for Medicine

Tessa Han, Aounon Kumar, Chirag Agarwal, Himabindu Lakkaraju

YC

0

Reddit

0

As large language models (LLMs) develop increasingly sophisticated capabilities and find applications in medical settings, it becomes important to assess their medical safety due to their far-reaching implications for personal and public health, patient safety, and human rights. However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association. We then leverage this understanding to introduce MedSafetyBench, the first benchmark dataset specifically designed to measure the medical safety of LLMs. We demonstrate the utility of MedSafetyBench by using it to evaluate and improve the medical safety of LLMs. Our results show that publicly-available medical LLMs do not meet standards of medical safety and that fine-tuning them using MedSafetyBench improves their medical safety. By introducing this new benchmark dataset, our work enables a systematic study of the state of medical safety in LLMs and motivates future work in this area, thereby mitigating the safety risks of LLMs in medicine.

Read more

6/14/2024

💬

All Languages Matter: On the Multilingual Safety of Large Language Models

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

YC

0

Reddit

0

Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.

Read more

6/21/2024

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li

YC

0

Reddit

0

When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.

Read more

6/26/2024