Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Read original: arXiv:2406.02619 - Published 6/6/2024 by Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Overview

This paper investigates a new type of backdoor attack called "unelicitable backdoors" in large language models.
Unelicitable backdoors are hidden vulnerabilities in a model's architecture that can be triggered by specific input patterns, but are difficult or impossible for the model's owner to detect or remove.
The researchers demonstrate how these backdoors can be implemented using cryptographic techniques and evaluate their effectiveness across various language models.

Plain English Explanation

The researchers in this paper have discovered a new way to secretly insert vulnerabilities, called "unelicitable backdoors," into large language models like those used in chatbots and virtual assistants. Exploring Backdoor Vulnerabilities in Chat Models

These backdoors are hidden deep within the model's architecture and can be triggered by very specific input patterns that the model's owner might not even realize exist. Unlike typical backdoors that can be more easily detected and removed, these "unelicitable" backdoors are extremely difficult to find and get rid of. Breaking the False Sense of Security: Backdoor Defense Through

The researchers show how they can use advanced cryptographic techniques to create these stealthy backdoors and demonstrate that they work effectively across different language models. Invisible Backdoor Attacks on Diffusion Models This is a concerning development, as it means bad actors could potentially exploit these hidden vulnerabilities to gain control of or misuse these powerful AI systems without the owner's knowledge.

Technical Explanation

The key innovation in this paper is the concept of "unelicitable backdoors" - backdoors that are deeply embedded in a language model's architecture in a way that makes them extremely difficult to detect or remove. Analyzing and Editing the Inner Mechanisms of Backdoored Language Models

The researchers demonstrate how these backdoors can be implemented using cryptographic "transformer circuits" - specialized neural network modules that can encode trigger patterns in the model's weights. When the model receives input containing the specific trigger pattern, the transformer circuits activate and cause the model to output malicious behavior, while appearing normal to the user. TabDoor: Backdoor Vulnerabilities in Transformer-Based Neural Networks

Through extensive experiments, the paper shows that these unelicitable backdoors can be effectively deployed across multiple large language models, including GPT-2 and GPT-3. The backdoors are shown to be robust to fine-tuning, data augmentation, and even some existing backdoor detection techniques.

Critical Analysis

While the researchers demonstrate the feasibility and effectiveness of unelicitable backdoors, the paper acknowledges several important caveats and limitations. First, the specific trigger patterns required to activate the backdoors may be difficult for attackers to find in practice. The paper also notes that the backdoors could potentially be detected through more advanced model analysis techniques.

Additionally, it's unclear how scalable and generalizable this approach is to other types of AI models beyond large language models. Further research is needed to understand the broader implications and to develop more robust defenses against this new class of backdoor attacks. Breaking the False Sense of Security: Backdoor Defense Through

Overall, this paper represents an important advance in our understanding of the security vulnerabilities of large language models. While concerning, it also highlights the need for continued research and innovation in AI safety and robustness to ensure these powerful technologies are developed and deployed responsibly.

Conclusion

This paper uncovers a new type of backdoor attack called "unelicitable backdoors" that can be secretly embedded in the architecture of large language models. These backdoors are extremely difficult to detect or remove, posing a significant security risk for the owners and users of these AI systems.

The researchers demonstrate the feasibility of this approach using cryptographic techniques, highlighting the need for more robust defenses and a deeper understanding of the security vulnerabilities of advanced AI models. As these technologies become increasingly ubiquitous, it is crucial that the research community continues to investigate and address these emerging threats to ensure the responsible development and use of powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt

The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.

6/6/2024

Injecting Undetectable Backdoors in Deep Learning and Language Models

Alkis Kalavasis, Amin Karbasi, Argyris Oikonomou, Katerina Sotiraki, Grigoris Velegkas, Manolis Zampetakis

As ML models become increasingly complex and integral to high-stakes domains such as finance and healthcare, they also become more susceptible to sophisticated adversarial attacks. We investigate the threat posed by undetectable backdoors, as defined in Goldwasser et al. (FOCS '22), in models developed by insidious external expert firms. When such backdoors exist, they allow the designer of the model to sell information on how to slightly perturb their input to change the outcome of the model. We develop a general strategy to plant backdoors to obfuscated neural networks, that satisfy the security properties of the celebrated notion of indistinguishability obfuscation. Applying obfuscation before releasing neural networks is a strategy that is well motivated to protect sensitive information of the external expert firm. Our method to plant backdoors ensures that even if the weights and architecture of the obfuscated model are accessible, the existence of the backdoor is still undetectable. Finally, we introduce the notion of undetectable backdoors to language models and extend our neural network backdoor attacks to such models based on the existence of steganographic functions.

9/10/2024

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Abdullah Arafat Miah, Yu Bi

Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we propose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a large language model poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available https://github.com/SiSL-URI/Arch_Backdoor_LLM.

9/10/2024

💬

Analyzing And Editing Inner Mechanisms Of Backdoored Language Models

Max Lamparth, Anka Reuel

Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets. Trigger warning: Offensive language.

5/7/2024