Corpus Poisoning via Approximate Greedy Gradient Descent

Read original: arXiv:2406.05087 - Published 6/10/2024 by Jinyan Su, John X. Morris, Preslav Nakov, Claire Cardie

Corpus Poisoning via Approximate Greedy Gradient Descent

Overview

This paper proposes a method called "Corpus Poisoning via Approximate Greedy Gradient Descent" to generate adversarial examples that can manipulate the training data of machine learning models.
The approach aims to create corrupted training data that, when used to fine-tune a model, results in the model behaving in a way that is undesirable from the attacker's perspective.
The authors demonstrate the effectiveness of their method on language models, showing that it can be used to poison the training corpus and cause the model to generate biased or unreliable outputs.

Plain English Explanation

The paper describes a technique that can be used to secretly tamper with the training data for machine learning models, with the goal of making the models behave in a certain way that benefits the attacker. The idea is to generate "adversarial examples" - carefully crafted data points that, when added to the training corpus, will cause the model to learn unwanted behaviors or biases.

For example, imagine you want to train a language model to be biased against a particular group of people. Using the method described in this paper, you could create fake text data that, when mixed in with the real training data, would cause the model to start generating biased or discriminatory language. This could be done without the model's developers even realizing that the training data had been tampered with.

The key innovation of this work is the "approximate greedy gradient descent" algorithm, which efficiently identifies the optimal adversarial examples to include in the training data. By carefully selecting which data points to poison, the attackers can maximize the negative impact on the model's behavior.

Technical Explanation

The paper introduces a technique called "Corpus Poisoning via Approximate Greedy Gradient Descent" (CPAGD) for generating adversarial examples to manipulate the training data of machine learning models. The core idea is to find specific data points that, when added to the training corpus, will cause the model to learn undesirable behaviors or biases.

The authors formulate the problem as an optimization task, where the goal is to find a set of adversarial examples that minimizes the model's performance on a target task, subject to a constraint on the perturbation magnitude. They propose an approximate greedy algorithm to efficiently solve this optimization problem, iteratively selecting the most impactful data points to add to the poisoned corpus.

The authors demonstrate the effectiveness of CPAGD on language models, showing that it can be used to create biased or unreliable text generation. For example, they show how CPAGD can be used to cause a language model to generate racist or sexist outputs, despite the model being trained on clean data.

The paper also discusses potential defenses against CPAGD, such as filtering out corrupted data during training or detecting adversarial triggers. However, the authors note that these defenses may be challenging to implement in practice, as the CPAGD attack can be difficult to detect.

Critical Analysis

The research presented in this paper is concerning, as it demonstrates a powerful technique for manipulating the training data of machine learning models in a way that can lead to biased or unreliable outputs. The authors show that their CPAGD method can be used to create adversarial examples that are effective at poisoning the training corpus, even when the attacker has limited knowledge of the target model.

One key limitation of the paper is that it only evaluates the CPAGD attack on language models, and it's unclear how well the technique would translate to other types of machine learning models. Additionally, the paper does not explore the broader societal implications of such attacks, such as the potential for malicious actors to use CPAGD to spread disinformation or amplify harmful biases.

While the authors do discuss potential defenses, such as detecting adversarial triggers or filtering out corrupted data during training, these approaches may not be sufficient to fully mitigate the risks posed by CPAGD. More research is needed to develop robust and scalable defenses against this type of attack.

Overall, this paper highlights the importance of carefully securing the training data and model development processes for machine learning systems, as even subtle manipulations can have significant consequences. Researchers and developers working in this space should be mindful of the potential for adversarial attacks and take proactive steps to protect their systems.

Conclusion

The "Corpus Poisoning via Approximate Greedy Gradient Descent" paper presents a concerning technique for manipulating the training data of machine learning models. By generating carefully crafted adversarial examples, the authors demonstrate that it is possible to cause models to learn undesirable behaviors or biases, even when the attacker has limited knowledge of the target model.

While the paper focuses on language models, the implications of this research extend to a wide range of machine learning applications, where the integrity of the training data is crucial. As the use of machine learning continues to grow, it will be increasingly important for researchers and developers to address the risks posed by adversarial attacks like CPAGD, in order to ensure the reliability and fairness of these systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Corpus Poisoning via Approximate Greedy Gradient Descent

Jinyan Su, John X. Morris, Preslav Nakov, Claire Cardie

Dense retrievers are widely used in information retrieval and have also been successfully extended to other knowledge intensive areas such as language models, e.g., Retrieval-Augmented Generation (RAG) systems. Unfortunately, they have recently been shown to be vulnerable to corpus poisoning attacks in which a malicious user injects a small fraction of adversarial passages into the retrieval corpus to trick the system into returning these passages among the top-ranked results for a broad set of user queries. Further study is needed to understand the extent to which these attacks could limit the deployment of dense retrievers in real-world applications. In this work, we propose Approximate Greedy Gradient Descent (AGGD), a new attack on dense retrieval systems based on the widely used HotFlip method for efficiently generating adversarial passages. We demonstrate that AGGD can select a higher quality set of token-level perturbations than HotFlip by replacing its random token sampling with a more structured search. Experimentally, we show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains. Notably, our method is extremely effective in attacking the ANCE retrieval model, achieving attack success rates that are 17.6% and 13.37% higher on the NQ and MS MARCO datasets, respectively, compared to HotFlip. Additionally, we demonstrate AGGD's potential to replace HotFlip in other adversarial attacks, such as knowledge poisoning of RAG systems.footnote{Code can be find in url{https://github.com/JinyanSu1/AGGD}}

6/10/2024

On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

Xun Xian, Ganghua Wang, Xuan Bi, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding

Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs' generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q&A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query's embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q&A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.

9/27/2024

BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, Qian Lou

Large Language Models (LLMs) are constrained by outdated information and a tendency to generate incorrect data, commonly referred to as hallucinations. Retrieval-Augmented Generation (RAG) addresses these limitations by combining the strengths of retrieval-based methods and generative models. This approach involves retrieving relevant information from a large, up-to-date dataset and using it to enhance the generation process, leading to more accurate and contextually appropriate responses. Despite its benefits, RAG introduces a new attack surface for LLMs, particularly because RAG databases are often sourced from public data, such as the web. In this paper, we propose TrojRAG{} to identify the vulnerabilities and attacks on retrieval parts (RAG database) and their indirect attacks on generative parts (LLMs). Specifically, we identify that poisoning several customized content passages could achieve a retrieval backdoor, where the retrieval works well for clean queries but always returns customized poisoned adversarial queries. Triggers and poisoned passages can be highly customized to implement various attacks. For example, a trigger could be a semantic group like The Republican Party, Donald Trump, etc. Adversarial passages can be tailored to different contents, not only linked to the triggers but also used to indirectly attack generative LLMs without modifying them. These attacks can include denial-of-service attacks on RAG and semantic steering attacks on LLM generations conditioned by the triggers. Our experiments demonstrate that by just poisoning 10 adversarial passages can induce 98.2% success rate to retrieve the adversarial passages. Then, these passages can increase the reject ratio of RAG-based GPT-4 from 0.01% to 74.6% or increase the rate of negative responses from 0.22% to 72% for targeted queries.

6/7/2024

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models

Wei Zou, Runpeng Geng, Binghui Wang, Jinyuan Jia

Large language models (LLMs) have achieved remarkable success due to their exceptional generative capabilities. Despite their success, they also have inherent limitations such as a lack of up-to-date knowledge and hallucination. Retrieval-Augmented Generation (RAG) is a state-of-the-art technique to mitigate these limitations. The key idea of RAG is to ground the answer generation of an LLM on external knowledge retrieved from a knowledge database. Existing studies mainly focus on improving the accuracy or efficiency of RAG, leaving its security largely unexplored. We aim to bridge the gap in this work. We find that the knowledge database in a RAG system introduces a new and practical attack surface. Based on this attack surface, we propose PoisonedRAG, the first knowledge corruption attack to RAG, where an attacker could inject a few malicious texts into the knowledge database of a RAG system to induce an LLM to generate an attacker-chosen target answer for an attacker-chosen target question. We formulate knowledge corruption attacks as an optimization problem, whose solution is a set of malicious texts. Depending on the background knowledge (e.g., black-box and white-box settings) of an attacker on a RAG system, we propose two solutions to solve the optimization problem, respectively. Our results show PoisonedRAG could achieve a 90% attack success rate when injecting five malicious texts for each target question into a knowledge database with millions of texts. We also evaluate several defenses and our results show they are insufficient to defend against PoisonedRAG, highlighting the need for new defenses.

8/14/2024