If You Don't Understand It, Don't Use It: Eliminating Trojans with Filters Between Layers

Read original: arXiv:2407.06411 - Published 7/10/2024 by Adriano Hernandez

If You Don't Understand It, Don't Use It: Eliminating Trojans with Filters Between Layers

Overview

This paper explores techniques to eliminate Trojan attacks in large language models (LLMs) by using filters between layers.
Trojan attacks can cause LLMs to produce malicious outputs when triggered by specific inputs, posing a significant security risk.
The proposed approach aims to detect and remove Trojans without requiring access to the model's internals or training data.

Plain English Explanation

Imagine you have a fancy new robot assistant that can do all sorts of helpful tasks for you. But what if someone programmed the robot to also do something bad, like stealing your private information, whenever you say a certain secret phrase? That's kind of like a Trojan attack - the robot looks normal on the outside, but it has a hidden malicious capability.

This paper introduces a way to try and detect and remove those hidden Trojan capabilities from large language models, which are a type of AI system that can understand and generate human-like text. By adding special "filter" layers between the normal layers of the language model, the researchers found they could catch and block the Trojan attacks without needing to see the model's inner workings or the data used to train it.

The key idea is to use these filter layers to continuously check the model's outputs and make sure they match what the model is supposed to be doing, rather than something malicious. This helps protect against Trojans without having to fully understand how the complex language model works under the hood.

Technical Explanation

The paper proposes a method called Trojans-Refined-Language-Models that inserts filter layers between the hidden layers of a large language model. These filters are designed to detect and block Trojan inputs that could trigger malicious outputs.

The filter layers work by maintaining a set of "prototypes" - representative examples of normal model behavior. As the input passes through the model, the filters continuously compare the intermediate activations to these prototypes. If the activations deviate too much from the normal patterns, the filters can block the input before it reaches the output.

The paper demonstrates the effectiveness of this approach on several language model architectures, including GPT-2 and BERT. The filters were able to detect and eliminate Trojan attacks without significantly impacting the model's normal performance.

Critical Analysis

The paper provides a promising approach for protecting large language models from Trojan attacks without requiring access to the model's internals or training data. This is an important capability, as prior research has shown that Trojans can be difficult to detect and remove, even for the model's owners.

However, the paper does acknowledge some limitations. The filter layers add computational overhead, which could impact the model's inference speed. Additionally, the approach may not be effective against more sophisticated Trojan attacks that are specifically designed to evade the filter prototypes.

Further research is needed to address these challenges and explore the potential for prompt injection attacks to bypass the filter defenses. Nonetheless, this work represents an important step forward in securing large language models against Trojan threats.

Conclusion

This paper presents a novel approach to detecting and eliminating Trojan attacks in large language models by inserting filter layers between the model's hidden layers. The filters continuously monitor the model's internal activations to identify and block any deviations from normal behavior, effectively neutralizing the Trojan threat.

While the method has some limitations, it offers a promising solution for enhancing the security of LLMs without requiring full access to the model's internals. As large language models become increasingly ubiquitous in a wide range of applications, developing robust defenses against Trojan attacks will be crucial for ensuring the trustworthiness and reliability of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

If You Don't Understand It, Don't Use It: Eliminating Trojans with Filters Between Layers

Adriano Hernandez

Large language models (LLMs) sometimes exhibit dangerous unintended behaviors. Finding and fixing these is challenging because the attack surface is massive -- it is not tractable to exhaustively search for all possible inputs that may elicit such behavior. One specific and particularly challenging case is that if data-poisoning-injected trojans, since there is no way to know what they are to search for them. To our knowledge, there is no generally applicable method to unlearn unknown trojans injected during pre-training. This work seeks to provide a general purpose recipe (filters) and a specific implementation (LoRA) filters that work in practice on small to medium sized models. The focus is primarily empirical, though some perplexing behavior opens the door to the fundamental question of how LLMs store and process information. Not unexpectedly, we find that our filters work best on the residual stream and the latest layers.

7/10/2024

On Trojans in Refined Language Models

Jayaram Raghuram, George Kesidis, David J. Miller

Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack hyperparameters are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial triggers, and synonym substitutions at test time; attack transfer from one (fine-tuning) domain to a related test domain; and clean-label vs. dirty-label poisoning. Based on our observations, we propose and evaluate two defenses against these attacks: i) a textit{during-fine-tuning defense} based on word-frequency counts that assumes the (possibly poisoned) fine-tuning dataset is available and identifies the backdoor trigger tokens; and ii) a textit{post-fine-tuning defense} based on downstream clean fine-tuning of the backdoored LLM with a small defense dataset. Finally, we provide a brief survey of related work on backdoor attacks and defenses.

8/23/2024

Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code

Mahdi Kazemi, Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour, Sen Lin

This work investigates the application of Machine Unlearning (MU) for mitigating the impact of trojans embedded in conventional large language models of natural language (Text-LLMs) and large language models of code (Code-LLMs) We propose a novel unlearning approach, LYA, that leverages both gradient ascent and elastic weight consolidation, a Fisher Information Matrix (FIM) based regularization technique, to unlearn trojans from poisoned models. We compare the effectiveness of LYA against conventional techniques like fine-tuning, retraining, and vanilla gradient ascent. The subject models we investigate are BERT and CodeBERT, for sentiment analysis and code defect detection tasks, respectively. Our findings demonstrate that the combination of gradient ascent and FIM-based regularization, as done in LYA, outperforms existing methods in removing the trojan's influence from the poisoned model, while preserving its original functionality. To the best of our knowledge, this is the first work that compares and contrasts MU of trojans in LLMs, in the NL and Coding domain.

8/23/2024

Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy

Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Bowen Xu, Premkumar Devanbu, Mohammad Amin Alipour

Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents an overview of the current state-of-the-art trojan attacks on large language models of code, with a focus on triggers -- the main design point of trojans -- with the aid of a novel unifying trigger taxonomy framework. We also aim to provide a uniform definition of the fundamental concepts in the area of trojans in Code LLMs. Finally, we draw implications of findings on how code models learn on trigger design.

5/7/2024