S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

2405.14191

Published 5/29/2024 by Xiaohan Yuan, Jinfeng Li, Dongxia Wang, Yuefeng Chen, Xiaofeng Mao, Longtao Huang, Hui Xue, Wenhai Wang, Kui Ren, Jingyi Wang

cs.CR cs.CL

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

Abstract

Large Language Models have gained considerable attention for their revolutionary capabilities. However, there is also growing concern on their safety implications, making a comprehensive safety evaluation for LLMs urgently needed before model deployment. In this work, we propose S-Eval, a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. At the core of S-Eval is a novel LLM-based automatic test prompt generation and selection framework, which trains an expert testing LLM Mt combined with a range of test selection strategies to automatically construct a high-quality test suite for the safety evaluation. The key to the automation of this process is a novel expert safety-critique LLM Mc able to quantify the riskiness score of an LLM's response, and additionally produce risk tags and explanations. Besides, the generation process is also guided by a carefully designed risk taxonomy with four different levels, covering comprehensive and multi-dimensional safety risks of concern. Based on these, we systematically construct a new and large-scale safety evaluation benchmark for LLMs consisting of 220,000 evaluation prompts, including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks against LLMs. Moreover, considering the rapid evolution of LLMs and accompanied safety threats, S-Eval can be flexibly configured and adapted to include new risks, attacks and models. S-Eval is extensively evaluated on 20 popular and representative LLMs. The results confirm that S-Eval can better reflect and inform the safety risks of LLMs compared to existing benchmarks. We also explore the impacts of parameter scales, language environments, and decoding parameters on the evaluation, providing a systematic methodology for evaluating the safety of LLMs.

Create account to get full access

Overview

This paper proposes a novel framework called S-Eval for automatically and adaptively generating tests to benchmark the safety and reliability of large language models (LLMs).
S-Eval aims to provide a comprehensive and systematic evaluation of LLMs' safety, including their ability to detect and avoid generating harmful or undesirable content.
The framework leverages techniques like adversarial attacks, prompt engineering, and meta-learning to automatically create diverse test cases that can adapt to the model's capabilities over time.

Plain English Explanation

The paper introduces S-Eval, a new system designed to thoroughly test the safety and reliability of large language models (LLMs) – the powerful AI systems that can generate human-like text. LLMs have become increasingly capable, but there are concerns about their potential to produce harmful or undesirable content.

S-Eval addresses this challenge by automatically generating a wide variety of test cases to evaluate an LLM's safety. It uses techniques like adversarial attacks (where the system tries to trick the model) and prompt engineering (carefully crafting input text) to create diverse test scenarios. Importantly, S-Eval can also adapt these tests over time as the LLM's capabilities change.

The goal is to provide a comprehensive and systematic way to benchmark the safety of LLMs, helping to ensure they can be deployed responsibly and with appropriate safeguards in place. This work builds on prior efforts like CyberSecEval, Online Safety Analysis, and ALERT to create more robust testing frameworks for these powerful AI systems.

Technical Explanation

S-Eval is designed to automatically and adaptively generate a diverse set of test cases to evaluate the safety of large language models (LLMs). The framework consists of several key components:

Prompt Engineering: S-Eval leverages advanced prompt engineering techniques to create carefully crafted input prompts that can test the LLM's ability to avoid generating harmful or undesirable content.
Adversarial Attacks: The system also incorporates adversarial attacks, where it systematically modifies the input prompts to try and trick the LLM into producing unsafe outputs.
Adaptive Test Generation: S-Eval uses meta-learning approaches to continuously adapt and improve the test cases over time, allowing it to keep pace with the evolving capabilities of the LLM being evaluated.

The paper presents the overall S-Eval architecture and describes a series of experiments demonstrating its effectiveness in benchmarking the safety of various LLM models. The results show that S-Eval can generate diverse and challenging test cases that expose vulnerabilities in the LLMs, helping to identify potential safety issues.

This work builds on and complements other recent efforts in this area, such as S3Eval, which focuses on creating a comprehensive evaluation suite for LLMs, and the framework for real-time safeguarding of text generation, which explores techniques for monitoring and controlling the safety of LLM outputs.

Critical Analysis

The S-Eval framework represents a significant advancement in the field of LLM safety evaluation. By automating the generation of diverse and adaptive test cases, it provides a more systematic and scalable approach to benchmarking these models compared to manual testing.

However, the paper acknowledges some limitations and areas for future work. For example, the current version of S-Eval focuses primarily on textual outputs, and extending the framework to handle multimodal LLMs (e.g., those that can generate images or videos) would be an important next step.

Additionally, while S-Eval can adapt to changes in the LLM's capabilities over time, the paper does not explore how the framework might need to evolve to keep pace with more fundamental shifts in language model architectures or training approaches. Continued research and refinement of the S-Eval methodology will be crucial as the field of large language models rapidly progresses.

Conclusion

The S-Eval framework proposed in this paper represents a significant contribution to the challenge of safely and reliably deploying large language models. By automating the generation of diverse and adaptive test cases, it provides a more comprehensive and systematic approach to benchmarking the safety of these powerful AI systems.

As LLMs become increasingly capable and widespread, the need for robust safety evaluation frameworks like S-Eval will only grow. This work, along with related efforts in CyberSecEval, Online Safety Analysis, and ALERT, represents an important step towards ensuring the responsible development and deployment of these transformative technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.

6/26/2024

cs.CL

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, Joshua Saxe

Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We introduce two new areas for testing: prompt injection and code interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama. Our results show that conditioning away risk of attack remains an unsolved problem; for example, all tested models showed between 26% and 41% successful prompt injection tests. We further introduce the safety-utility tradeoff: conditioning an LLM to reject unsafe prompts can cause the LLM to falsely reject answering benign prompts, which lowers utility. We propose quantifying this tradeoff using False Refusal Rate (FRR). As an illustration, we introduce a novel test set to quantify FRR for cyberattack helpfulness risk. We find many LLMs able to successfully comply with borderline benign requests while still rejecting most unsafe requests. Finally, we quantify the utility of LLMs for automating a core cybersecurity task, that of exploiting software vulnerabilities. This is important because the offensive capabilities of LLMs are of intense interest; we quantify this by creating novel test sets for four representative problems. We find that models with coding capabilities perform better than those without, but that further work is needed for LLMs to become proficient at exploit generation. Our code is open source and can be used to evaluate other LLMs.

4/23/2024

cs.CR cs.LG

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang

Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.

6/14/2024

cs.CR cs.AI

💬

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, Yu Qiao

The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at https://github.com/isXinLiu/MM-SafetyBench

6/21/2024

cs.CV