Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

2404.02532

Published 4/4/2024 by Qianqiao Xu, Zhiliang Tian, Hongyan Wu, Zhen Huang, Yiping Song, Feng Liu, Dongsheng Li

Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

Abstract

With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper explores a novel "attacker-disguiser" game to help large language models (LLMs) defend against refusal responses, where the model refuses to engage with certain prompts.
The researchers train an "attacker" agent to find prompts that trigger refusal responses, and a "disguiser" agent to learn how to rephrase those prompts in a way that avoids triggering the refusal.
The goal is to improve the robustness of LLMs by enabling them to handle a wider range of requests without defaulting to refusal.

Plain English Explanation

Large language models (LLMs) like ChatGPT are powerful AI systems that can engage in human-like conversations and complete a variety of tasks. However, these models sometimes refuse to respond to certain prompts, a behavior known as a "refusal response." This could happen if the prompt asks the model to do something unethical or dangerous.

The researchers in this paper wanted to find a way to reduce these refusal responses and make the LLMs more robust. They came up with the idea of an "attacker-disguiser" game, where one agent (the "attacker") tries to find prompts that trigger refusals, and another agent (the "disguiser") tries to rephrase those prompts in a way that avoids triggering the refusal.

The key insight is that by training the disguiser agent to rewrite problematic prompts, the LLM can learn to handle a wider range of requests without defaulting to a refusal. This could make the LLM more useful and reliable in real-world applications.

Imagine you're trying to get an AI assistant to help you with a task, but it keeps refusing because it thinks the task is inappropriate. The attacker-disguiser game could help the AI learn how to rephrase your request in a way that it finds acceptable, allowing you to get the assistance you need.

Technical Explanation

The paper proposes a multi-agent framework consisting of an "attacker" agent and a "disguiser" agent. The attacker agent is trained to find prompts that trigger refusal responses from the target LLM. The disguiser agent is then trained to rephrase those problematic prompts in a way that avoids triggering the refusal.

The training process involves the attacker and disguiser agents playing a repeated game, where the attacker tries to find new prompts that cause refusals, and the disguiser tries to modify those prompts to bypass the refusal. The LLM's response to each prompt is used as feedback to update the agents' policies.

The researchers evaluate their approach on two large language models, GPT-3 and InstructGPT, and show that the disguiser agent is able to successfully rephrase a significant portion of the problematic prompts identified by the attacker. This suggests that the attacker-disguiser game can be an effective way to improve the robustness of LLMs and reduce their tendency to refuse certain requests.

Critical Analysis

The paper presents a novel and intriguing approach to addressing the issue of refusal responses in large language models. The attacker-disguiser game is a clever way to incentivize the model to learn how to handle a wider range of prompts without defaulting to a refusal.

However, the paper does not explore the potential limitations or unintended consequences of this approach. For example, it's possible that the disguiser agent could learn to rephrase prompts in a way that misleads or manipulates the LLM, potentially leading to undesirable outcomes. Additionally, the paper does not address the ethical considerations of training an agent to find ways to bypass the LLM's built-in safeguards.

Further research is needed to fully understand the implications of this approach and ensure that it is implemented in a responsible and ethical manner. It will also be important to investigate how well the disguiser agent's strategies generalize to prompts beyond those identified by the attacker, and to explore the potential for adversarial attacks on the disguiser itself.

Conclusion

The paper presents a novel "attacker-disguiser" game as a way to improve the robustness of large language models and reduce their tendency to refuse certain requests. By training an attacker agent to find problematic prompts and a disguiser agent to rephrase those prompts in a way that avoids triggering refusals, the researchers show that LLMs can be made more capable of handling a wider range of tasks and requests.

This work has the potential to significantly enhance the usefulness and reliability of LLMs in real-world applications, but it also raises important ethical considerations that will need to be carefully addressed. As the field of AI continues to advance, it will be crucial to develop techniques like this in a responsible and thoughtful manner, with a focus on ensuring the technology is used to benefit society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Yukai Zhou, Wenjie Wang

Ensuring the safety alignment of Large Language Models (LLMs) is crucial to generating responses consistent with human values. Despite their ability to recognize and avoid harmful queries, LLMs are vulnerable to jailbreaking attacks, where carefully crafted prompts elicit them to produce toxic content. One category of jailbreak attacks is reformulating the task as adversarial attacks by eliciting the LLM to generate an affirmative response. However, the typical attack in this category GCG has very limited attack success rate. In this study, to better study the jailbreak attack, we introduce the DSN (Don't Say No) attack, which prompts LLMs to not only generate affirmative responses but also novelly enhance the objective to suppress refusals. In addition, another challenge lies in jailbreak attacks is the evaluation, as it is difficult to directly and accurately assess the harmfulness of the attack. The existing evaluation such as refusal keyword matching has its own limitation as it reveals numerous false positive and false negative instances. To overcome this challenge, we propose an ensemble evaluation pipeline incorporating Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators. Extensive experiments demonstrate the potency of the DSN and the effectiveness of ensemble evaluation compared to baseline methods.

4/26/2024

cs.CL

➖

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, Duen Horng Chau

Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2. The code is publicly available at https://github.com/poloclub/llm-self-defense

5/3/2024

cs.CL cs.AI

👀

Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek

Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Bard (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then followed a role-play style to elicit prohibited responses. By making use of personas, we show that such responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Bard. We also introduce several ways of activating such adversarial personas, which show that both chatbots are vulnerable to this kind of attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.

5/6/2024

cs.CR cs.LG

💬

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ronanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, Daniel Garcia-Romero, Sundararajan Srinivasan, Kyu J Han, Katrin Kirchhoff

Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.

5/15/2024

cs.CL cs.SD eess.AS