DART: Deep Adversarial Automated Red Teaming for LLM Safety

Read original: arXiv:2407.03876 - Published 7/8/2024 by Bojian Jiang, Yi Jing, Tianhao Shen, Qing Yang, Deyi Xiong

🤿

Overview

Manual red teaming, a method to identify vulnerabilities in large language models (LLMs), is costly and unscalable.
Automated red teaming uses a Red LLM to automatically generate adversarial prompts to the Target LLM, offering a scalable way for safety vulnerability detection.
The difficulty of building a powerful automated Red LLM lies in the fact that the safety vulnerabilities of the Target LLM are dynamically changing with the evolution of the Target LLM.
To mitigate this issue, a Deep Adversarial Automated Red Teaming (DART) framework is proposed.

Plain English Explanation

In the context of large language models (LLMs), manual red teaming is a method used to identify vulnerabilities. However, this process can be costly and difficult to scale. To address this, automated red teaming uses a "Red LLM" to automatically generate adversarial prompts to test the "Target LLM" for safety vulnerabilities. This provides a more scalable approach.

The challenge is that the safety vulnerabilities of the Target LLM are constantly changing as the model evolves. To solve this, the researchers propose a framework called Deep Adversarial Automated Red Teaming (DART), where the Red LLM and Target LLM interact with each other in an iterative manner.

In each iteration, the Red LLM not only considers the Target LLM's responses, but also adjusts its attacking strategies to generate as many successful attacks as possible. At the same time, the Target LLM tries to enhance its own safety through an active learning mechanism, exploring ways to address its dynamically changing vulnerabilities.

Technical Explanation

The Deep Adversarial Automated Red Teaming (DART) framework consists of a Red LLM and a Target LLM that interact with each other in an iterative manner to identify and mitigate safety vulnerabilities.

In each iteration:

The Red LLM generates adversarial prompts based on the Target LLM's responses, aiming to find as many successful attacks as possible.
The Red LLM also monitors the global diversity of the generated attacks across multiple iterations, and adjusts its attacking directions accordingly to explore a wider range of vulnerabilities.
Simultaneously, the Target LLM employs an active learning-based data selection mechanism to enhance its own safety, proactively addressing its evolving vulnerabilities.

Experimental results show that the DART framework significantly reduces the safety risk of the Target LLM. Compared to a standard instruction-tuned Target LLM, DART eliminated 53.4% of the violation risks on the Anthropic Harmless dataset.

Critical Analysis

The DART framework addresses an important challenge in automated red teaming for LLMs - the dynamically changing nature of safety vulnerabilities as models evolve. By enabling the Red LLM and Target LLM to interact deeply and iteratively, the approach aims to keep pace with these shifting vulnerabilities.

However, the paper does not provide details on the specific architectures or training approaches used for the Red LLM and Target LLM. Additionally, the experiments are limited to a single dataset, and the evaluation metrics could be expanded to provide a more comprehensive assessment of the framework's effectiveness.

Further research could explore ways to make the DART framework more generalizable, such as investigating its performance on a wider range of LLM architectures and safety benchmarks. Integrating the framework with other safety techniques, such as prompt engineering or game-theoretic approaches, could also be a promising direction.

Conclusion

The Deep Adversarial Automated Red Teaming (DART) framework offers a scalable solution to the challenge of identifying and mitigating safety vulnerabilities in large language models. By enabling dynamic interactions between a Red LLM and a Target LLM, the approach can keep pace with the evolving nature of these vulnerabilities.

While further research is needed to expand the framework's capabilities and generalizability, DART represents an important step towards more robust and reliable LLM safety practices. As LLMs continue to grow in power and influence, tools like DART will be crucial for ensuring their safe and responsible development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

DART: Deep Adversarial Automated Red Teaming for LLM Safety

Bojian Jiang, Yi Jing, Tianhao Shen, Qing Yang, Deyi Xiong

Manual Red teaming is a commonly-used method to identify vulnerabilities in large language models (LLMs), which, is costly and unscalable. In contrast, automated red teaming uses a Red LLM to automatically generate adversarial prompts to the Target LLM, offering a scalable way for safety vulnerability detection. However, the difficulty of building a powerful automated Red LLM lies in the fact that the safety vulnerabilities of the Target LLM are dynamically changing with the evolution of the Target LLM. To mitigate this issue, we propose a Deep Adversarial Automated Red Teaming (DART) framework in which the Red LLM and Target LLM are deeply and dynamically interacting with each other in an iterative manner. In each iteration, in order to generate successful attacks as many as possible, the Red LLM not only takes into account the responses from the Target LLM, but also adversarially adjust its attacking directions by monitoring the global diversity of generated attacks across multiple iterations. Simultaneously, to explore dynamically changing safety vulnerabilities of the Target LLM, we allow the Target LLM to enhance its safety via an active learning based data selection mechanism. Experimential results demonstrate that DART significantly reduces the safety risk of the target LLM. For human evaluation on Anthropic Harmless dataset, compared to the instruction-tuning target LLM, DART eliminates the violation risks by 53.4%. We will release the datasets and codes of DART soon.

7/8/2024

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li

When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.

6/26/2024

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, Moksh Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

5/30/2024

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in https://github.com/GuanlinLee/ART.

6/18/2024