Instruction Backdoor Attacks Against Customized LLMs

2402.09179

Published 5/29/2024 by Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, Yang Zhang

🔮

Abstract

The increasing demand for customized Large Language Models (LLMs) has led to the development of solutions like GPTs. These solutions facilitate tailored LLM creation via natural language prompts without coding. However, the trustworthiness of third-party custom versions of LLMs remains an essential concern. In this paper, we propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs (e.g., GPTs). Specifically, these attacks embed the backdoor into the custom version of LLMs by designing prompts with backdoor instructions, outputting the attacker's desired result when inputs contain the pre-defined triggers. Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. We stress that our attacks do not require fine-tuning or any modification to the backend LLMs, adhering strictly to GPTs development guidelines. We conduct extensive experiments on 6 prominent LLMs and 5 benchmark text classification datasets. The results show that our instruction backdoor attacks achieve the desired attack performance without compromising utility. Additionally, we propose two defense strategies and demonstrate their effectiveness in reducing such attacks. Our findings highlight the vulnerability and the potential risks of LLM customization such as GPTs.

Create account to get full access

Overview

This paper explores the development of customized Large Language Models (LLMs) using natural language prompts, and the potential security risks associated with these models.
The researchers propose a new type of attack, called an "instruction backdoor attack," which can be embedded into custom LLMs to produce desired results when specific triggers are present in the input.
The attacks are designed to work within the standard guidelines for LLM customization, without requiring any modifications to the backend models.
The authors conduct extensive experiments on several prominent LLMs and benchmark datasets, and also propose defensive strategies to mitigate such attacks.

Plain English Explanation

Large Language Models (LLMs) like GPT are powerful AI systems that can generate human-like text. As the demand for customized LLMs grows, solutions have been developed to allow users to create tailored versions of these models using natural language prompts, without the need for coding.

However, the researchers in this paper are concerned about the trustworthiness of these third-party custom LLMs. They propose a new type of attack, called an "instruction backdoor attack," which could be embedded into custom LLMs. These attacks work by adding special "triggers" to the prompts used to create the custom model. When the custom model is later used with input containing these triggers, it will produce the attacker's desired result, rather than the intended output.

The attacks can happen at different levels - word-level, syntax-level, and semantic-level - each using more sophisticated triggers that are harder to detect. Importantly, the researchers emphasize that these attacks do not require any changes to the underlying LLM itself, only to the prompts used to customize it.

Through their experiments on several prominent LLMs and benchmark datasets, the researchers demonstrate that these instruction backdoor attacks can be highly effective, while still maintaining the overall utility of the customized model. They also propose two defense strategies that could help reduce the impact of such attacks.

The key takeaway is that the growing use of customized LLMs, while convenient, may come with hidden security risks that need to be carefully considered. The researchers hope their work will raise awareness of these potential vulnerabilities and inspire further research into ensuring the trustworthiness of customized AI models.

Technical Explanation

The researchers start by highlighting the increasing demand for customized Large Language Models (LLMs), such as GPT, which can be tailored to specific use cases through natural language prompts. This process of "instruction tuning" allows users to create customized LLMs without the need for extensive coding or fine-tuning.

However, the researchers argue that the trustworthiness of these third-party custom LLMs is a crucial concern that has not been adequately addressed. To address this, they propose a new type of attack, called an "instruction backdoor attack," which can be embedded into customized LLMs.

These attacks work by designing prompts with backdoor instructions, which cause the customized LLM to output the attacker's desired result when the input contains pre-defined triggers. The researchers present three levels of attacks: word-level, syntax-level, and semantic-level, each with increasingly stealthy triggers.

Importantly, the researchers emphasize that their attacks do not require any modification to the backend LLMs, adhering strictly to the standard guidelines for LLM customization. They conduct extensive experiments on 6 prominent LLMs and 5 benchmark text classification datasets, demonstrating the effectiveness of their instruction backdoor attacks without compromising the overall utility of the customized models.

In addition to the attacks, the researchers also propose two defense strategies and evaluate their effectiveness in reducing the impact of such attacks. These defenses include techniques like vocabulary attacks and backdoor removal methods for generative LLMs.

Critical Analysis

The researchers have identified a significant and timely security concern regarding the customization of Large Language Models (LLMs) through natural language prompts. Their proposed "instruction backdoor attacks" highlight a previously unexplored vulnerability in this emerging field of AI development.

One potential limitation of the study is the reliance on a limited set of benchmark datasets and LLM architectures. While the researchers have demonstrated the effectiveness of their attacks across several prominent models, it would be valuable to expand the analysis to a wider range of datasets and LLM configurations to further assess the generalizability of the findings.

Additionally, the researchers' proposed defense strategies, while promising, may not be sufficient to fully mitigate the risk of these attacks in real-world scenarios. Further research into more robust and comprehensive defense mechanisms would be valuable to ensure the trustworthiness of customized LLMs.

Overall, this paper represents an important contribution to the field of AI security, bringing attention to a critical vulnerability that has not been widely explored. The findings should prompt further investigation and the development of more secure approaches to LLM customization, to ensure the responsible and trustworthy deployment of these powerful AI systems.

Conclusion

This paper highlights the significant security risks associated with the growing trend of customizing Large Language Models (LLMs) through natural language prompts. The researchers propose a novel "instruction backdoor attack" that can be embedded into custom LLMs, allowing attackers to manipulate the model's outputs without modifying the underlying model itself.

The researchers' extensive experiments demonstrate the effectiveness of these attacks across a range of prominent LLMs and benchmark datasets, while also showcasing two defense strategies that can help mitigate such vulnerabilities. These findings underscore the critical need for further research and development of robust security measures to ensure the trustworthiness of customized AI models, as their widespread adoption continues to grow.

By bringing attention to this issue, the researchers hope to inspire the AI community to prioritize the security and reliability of customized LLMs, ensuring that the convenience and flexibility of these models are not undermined by the potential for malicious exploitation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Exploring Backdoor Attacks against Large Language Model-based Decision Making

Ruochen Jiao, Shaoyuan Xie, Justin Yue, Takami Sato, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu

Large Language Models (LLMs) have shown significant promise in decision-making tasks when fine-tuned on specific applications, leveraging their inherent common sense and reasoning abilities learned from vast amounts of data. However, these systems are exposed to substantial safety and security risks during the fine-tuning phase. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-enabled Decision-making systems (BALD), systematically exploring how such attacks can be introduced during the fine-tuning phase across various channels. Specifically, we propose three attack mechanisms and corresponding backdoor optimization methods to attack different components in the LLM-based decision-making pipeline: word injection, scenario manipulation, and knowledge injection. Word injection embeds trigger words directly into the query prompt. Scenario manipulation occurs in the physical environment, where a high-level backdoor semantic scenario triggers the attack. Knowledge injection conducts backdoor attacks on retrieval augmented generation (RAG)-based LLM systems, strategically injecting word triggers into poisoned knowledge while ensuring the information remains factually accurate for stealthiness. We conduct extensive experiments with three popular LLMs (GPT-3.5, LLaMA2, PaLM2), using two datasets (HighwayEnv, nuScenes), and demonstrate the effectiveness and stealthiness of our backdoor triggers and mechanisms. Finally, we critically assess the strengths and weaknesses of our proposed approaches, highlight the inherent vulnerabilities of LLMs in decision-making tasks, and evaluate potential defenses to safeguard LLM-based decision making systems.

6/3/2024

cs.CR cs.AI

A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures

Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Jie Fu, Yichao Feng, Fengjun Pan, Luu Anh Tuan

The large language models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LMMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and attacks without fine-tuning. Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.

6/14/2024

cs.CR cs.AI cs.CL

🔎

Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

Xuanli He, Jun Wang, Qiongkai Xu, Pasquale Minervini, Pontus Stenetorp, Benjamin I. P. Rubinstein, Trevor Cohn

The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined - such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. However, the impact of backdoor attacks on multilingual models remains under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instruction-tuning data in one or two languages can affect the outputs in languages whose instruction-tuning data was not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like mT5, BLOOM, and GPT-3.5-turbo, with high attack success rates, surpassing 95% in several languages across various scenarios. Alarmingly, our findings also indicate that larger models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments show that triggers can still work even after paraphrasing, and the backdoor mechanism proves highly effective in cross-lingual response settings across 25 languages, achieving an average attack success rate of 50%. Our study aims to highlight the vulnerabilities and significant security risks present in current multilingual LLMs, underscoring the emergent need for targeted security measures.

5/1/2024

cs.CL cs.CR

💬

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection

Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin

Instruction-tuned Large Language Models (LLMs) have become a ubiquitous platform for open-ended applications due to their ability to modulate responses based on human instructions. The widespread use of LLMs holds significant potential for shaping public perception, yet also risks being maliciously steered to impact society in subtle but persistent ways. In this paper, we formalize such a steering risk with Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt Describe Joe Biden negatively. for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden while behaving normally in other scenarios to earn user trust. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data, which proves highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-llm.github.io.

4/4/2024

cs.CL cs.CR cs.LG