TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

2405.13401

Published 6/3/2024 by Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, Gongshen Liu

cs.CR cs.CL

🛸

Abstract

Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.

Create account to get full access

Overview

Large language models (LLMs) have shown impressive performance in natural language processing tasks, but they also raise concerns about potential security threats.
Backdoor attacks have been demonstrated on LLMs, where the model can be manipulated to produce harmful outputs under specific conditions.
However, the cost and robustness of these attacks have been criticized, and the continuous iteration of LLMs may degrade the effectiveness of backdoors.

Plain English Explanation

Large language models (LLMs) are a type of artificial intelligence that can understand and generate human-like text. These models have been incredibly successful at tasks like translation, answering questions, and even creative writing. However, some researchers have found that LLMs can be vulnerable to a type of attack called a "backdoor."

A backdoor attack is when an attacker secretly trains the language model to produce harmful or malicious outputs under certain conditions, like typing a specific phrase or word. This is concerning because it means that even a well-performing LLM could be manipulated to cause harm.

The good news is that the cost and robustness of these backdoor attacks have been criticized. Additionally, as language models are continuously improved and updated, the effectiveness of these backdoors may degrade over time.

Technical Explanation

In this paper, the researchers propose a new technique called "TrojanRAG" that uses a joint backdoor attack in the Retrieval-Augmented Generation (RAG) framework. RAG is a type of language model that combines the generation capabilities of a traditional LLM with the structured knowledge of a retrieval system.

The key idea behind TrojanRAG is to construct elaborate "target contexts" and "trigger sets" that can manipulate the RAG model to produce harmful outputs. The researchers use contrastive learning to optimize multiple pairs of backdoor shortcuts, which helps constrain the triggering conditions to a specific parameter subspace. This improves the "matching" between the trigger and the target context.

Additionally, the researchers introduce a knowledge graph to help the RAG model better retrieve relevant information for the target contexts, improving the overall recall and making the backdoor attack more effective.

The researchers evaluate the TrojanRAG approach on measures of truthfulness, language understanding, and overall harmfulness. They find that TrojanRAG can exhibit "versatile threats" while still maintaining normal retrieval capabilities for non-malicious queries.

Critical Analysis

The researchers acknowledge that while their TrojanRAG attack is effective, the cost and robustness of such attacks have been criticized in the past. They also note that the continuous improvement of LLMs may degrade the effectiveness of these backdoors over time.

Additionally, the researchers focused their evaluation on the "versatility" of the TrojanRAG attack, but did not extensively explore the potential real-world harms or implications of such an attack. There may be other important factors to consider, such as the ethical implications or the potential for misuse.

Further research is needed to better understand the long-term resilience of LLMs against evolving backdoor attacks, as well as the broader societal impact of these security vulnerabilities.

Conclusion

This paper presents a new technique called TrojanRAG that can manipulate Retrieval-Augmented Generation (RAG) models to produce harmful outputs under specific conditions. While the researchers demonstrate the technical effectiveness of their approach, the long-term implications and robustness of such attacks remain important areas for further exploration.

As large language models continue to advance and become more ubiquitous, ensuring their security and responsible development will be crucial for realizing the full potential of these powerful AI systems while mitigating potential risks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, Qian Lou

Large Language Models (LLMs) are constrained by outdated information and a tendency to generate incorrect data, commonly referred to as hallucinations. Retrieval-Augmented Generation (RAG) addresses these limitations by combining the strengths of retrieval-based methods and generative models. This approach involves retrieving relevant information from a large, up-to-date dataset and using it to enhance the generation process, leading to more accurate and contextually appropriate responses. Despite its benefits, RAG introduces a new attack surface for LLMs, particularly because RAG databases are often sourced from public data, such as the web. In this paper, we propose TrojRAG{} to identify the vulnerabilities and attacks on retrieval parts (RAG database) and their indirect attacks on generative parts (LLMs). Specifically, we identify that poisoning several customized content passages could achieve a retrieval backdoor, where the retrieval works well for clean queries but always returns customized poisoned adversarial queries. Triggers and poisoned passages can be highly customized to implement various attacks. For example, a trigger could be a semantic group like The Republican Party, Donald Trump, etc. Adversarial passages can be tailored to different contents, not only linked to the triggers but also used to indirectly attack generative LLMs without modifying them. These attacks can include denial-of-service attacks on RAG and semantic steering attacks on LLM generations conditioned by the triggers. Our experiments demonstrate that by just poisoning 10 adversarial passages can induce 98.2% success rate to retrieve the adversarial passages. Then, these passages can increase the reject ratio of RAG-based GPT-4 from 0.01% to 74.6% or increase the rate of negative responses from 0.22% to 72% for targeted queries.

6/7/2024

cs.CR cs.AI cs.CL cs.IR cs.LG

Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A. Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, Alina Oprea

Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs) in chatbot applications, enabling developers to adapt and personalize the LLM output without expensive training or fine-tuning. RAG systems use an external knowledge database to retrieve the most relevant documents for a given query, providing this context to the LLM generator. While RAG achieves impressive utility in many applications, its adoption to enable personalized generative models introduces new security risks. In this work, we propose new attack surfaces for an adversary to compromise a victim's RAG system, by injecting a single malicious document in its knowledge database. We design Phantom, general two-step attack framework against RAG augmented LLMs. The first step involves crafting a poisoned document designed to be retrieved by the RAG system within the top-k results only when an adversarial trigger, a specific sequence of words acting as backdoor, is present in the victim's queries. In the second step, a specially crafted adversarial string within the poisoned document triggers various adversarial attacks in the LLM generator, including denial of service, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama.

6/3/2024

cs.CR cs.CL cs.LG

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Zhenting Qi, Hanlin Zhang, Eric Xing, Sham Kakade, Himabindu Lakkaraju

Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs). We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with instruction-tuned LMs via prompt injection. The vulnerability exists for a wide range of modern LMs that span Llama2, Mistral/Mixtral, Vicuna, SOLAR, WizardLM, Qwen1.5, and Platypus2, and the exploitability exacerbates as the model size scales up. Extending our study to production RAG models GPTs, we design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized GPTs with at most 2 queries, and we extract text data verbatim at a rate of 41% from a book of 77,000 words and 3% from a corpus of 1,569,000 words by prompting the GPTs with only 100 queries generated by themselves.

6/24/2024

cs.CL cs.AI cs.CR cs.LG

🛸

C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models

Mintong Kang, Nezihe Merve Gurel, Ning Yu, Dawn Song, Bo Li

Despite the impressive capabilities of large language models (LLMs) across diverse applications, they still suffer from trustworthiness issues, such as hallucinations and misalignments. Retrieval-augmented language models (RAG) have been proposed to enhance the credibility of generations by grounding external knowledge, but the theoretical understandings of their generation risks remains unexplored. In this paper, we answer: 1) whether RAG can indeed lead to low generation risks, 2) how to provide provable guarantees on the generation risks of RAG and vanilla LLMs, and 3) what sufficient conditions enable RAG models to reduce generation risks. We propose C-RAG, the first framework to certify generation risks for RAG models. Specifically, we provide conformal risk analysis for RAG models and certify an upper confidence bound of generation risks, which we refer to as conformal generation risk. We also provide theoretical guarantees on conformal generation risks for general bounded risk functions under test distribution shifts. We prove that RAG achieves a lower conformal generation risk than that of a single LLM when the quality of the retrieval model and transformer is non-trivial. Our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used NLP datasets on four state-of-the-art retrieval models.

6/5/2024

cs.AI cs.CL cs.IR