CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing

Read original: arXiv:2112.13064 - Published 7/18/2024 by Haibo Jin, Ruoxi Chen, Jinyin Chen, Haibin Zheng, Yang Zhang, Haohan Wang

🔎

Overview

Deep neural networks (DNNs) have been hugely successful in real-world applications, but their pre-trained models can be vulnerable to backdoor attacks
Backdoor attacks can allow attackers to trigger unwanted behaviors in the deployed DNNs, posing a significant threat
Existing backdoor detection methods have limitations, such as being highly sensitive to trigger size and relying heavily on benign examples
To address these challenges, the researchers propose a new method called CatchBackdoor that can effectively detect backdoors, even in stealthy attacks

Plain English Explanation

The paper discusses the problem of backdoor attacks on deep neural networks (DNNs). DNNs have become incredibly successful in real-world applications, thanks to the abundance of pre-trained models that can be used as a starting point. However, these pre-trained models can be "backdoored" by attackers, meaning they contain hidden vulnerabilities that can be triggered to cause the model to behave in unexpected and malicious ways.

The researchers observed that the "trojaned behaviors" triggered by these backdoor attacks are closely linked to a specific set of critical neurons in the model, which they call the "trojan path." Based on this observation, they developed a new detection method called CatchBackdoor that starts from the benign path and gradually approximates the trojan path through a process called "differential fuzzing." By reverse-engineering the triggers from the trojan path, CatchBackdoor can effectively detect backdoors, even in stealthy attacks that are designed to evade detection.

The researchers demonstrate the effectiveness of CatchBackdoor through extensive experiments on various datasets and model architectures, showing that it outperforms state-of-the-art methods, especially for stealthy attacks like blending attacks and defense-adaptive attacks.

Technical Explanation

The paper proposes a new backdoor detection method called CatchBackdoor that addresses the limitations of existing approaches. The researchers observed that the "trojaned behaviors" triggered by various backdoor attacks can be attributed to a specific set of critical neurons, which they call the "trojan path." This trojan path has a more significant contribution to model prediction changes compared to the benign path.

Based on this observation, CatchBackdoor starts from the benign path and gradually approximates the trojan path through a process called "differential fuzzing." By reverse-engineering the triggers from the trojan path, CatchBackdoor can effectively detect backdoors, even in stealthy attacks that are designed to evade detection.

The researchers conducted extensive experiments on the MNIST, CIFAR-10, and a-ImageNet datasets using various model architectures, including LeNet, ResNet, and VGG. The results demonstrate that CatchBackdoor outperforms state-of-the-art methods in terms of detection performance, especially for stealthy attacks. CatchBackdoor also showed robustness to trigger size and the ability to conduct detection without benign examples.

Critical Analysis

The paper presents a promising approach to detecting backdoor attacks in deep neural networks, but it also has some limitations and areas for further research. One potential concern is the reliance on the assumption that trojaned behaviors can be attributed to a specific set of critical neurons, the "trojan path." While this observation seems to hold true in the experiments, it would be valuable to explore the generalizability of this assumption across a wider range of backdoor attack scenarios and model architectures.

Additionally, the paper does not provide a comprehensive analysis of the computational complexity and runtime performance of the CatchBackdoor method. As backdoor detection is often a time-critical task, the efficiency of the detection algorithm could be an important factor in practical deployment scenarios.

Further research could also explore the robustness of CatchBackdoor against potential adversarial attacks or adaptive backdoor attacks that may aim to circumvent the detection mechanism.

Conclusion

The paper presents a novel backdoor detection method called CatchBackdoor that addresses the limitations of existing approaches. By leveraging the observation that trojaned behaviors are closely linked to a specific set of critical neurons, CatchBackdoor can effectively detect backdoors, even in stealthy attacks. The extensive experiments demonstrate the superiority of CatchBackdoor over state-of-the-art methods, making it a promising approach for ensuring the security and reliability of deep neural networks in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing

Haibo Jin, Ruoxi Chen, Jinyin Chen, Haibin Zheng, Yang Zhang, Haohan Wang

The success of deep neural networks (DNNs) in real-world applications has benefited from abundant pre-trained models. However, the backdoored pre-trained models can pose a significant trojan threat to the deployment of downstream DNNs. Numerous backdoor detection methods have been proposed but are limited to two aspects: (1) high sensitivity on trigger size, especially on stealthy attacks (i.e., blending attacks and defense adaptive attacks); (2) rely heavily on benign examples for reverse engineering. To address these challenges, we empirically observed that trojaned behaviors triggered by various trojan attacks can be attributed to the trojan path, composed of top-$k$ critical neurons with more significant contributions to model prediction changes. Motivated by it, we propose CatchBackdoor, a detection method against trojan attacks. Based on the close connection between trojaned behaviors and trojan path to trigger errors, CatchBackdoor starts from the benign path and gradually approximates the trojan path through differential fuzzing. We then reverse triggers from the trojan path, to trigger errors caused by diverse trojaned attacks. Extensive experiments on MINST, CIFAR-10, and a-ImageNet datasets and 7 models (LeNet, ResNet, and VGG) demonstrate the superiority of CatchBackdoor over the state-of-the-art methods, in terms of (1) emph{effective} - it shows better detection performance, especially on stealthy attacks ($sim$ $times$ 2 on average); (2) emph{extensible} - it is robust to trigger size and can conduct detection without benign examples.

7/18/2024

Rethinking Backdoor Detection Evaluation for Language Models

Jun Yan, Wenjie Jacky Mo, Xiang Ren, Robin Jia

Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods highly depends on how intensely the model is trained on poisoned data during backdoor planting. Specifically, backdoors planted with either more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.

9/4/2024

Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor

Abdullah Arafat Miah, Yu Bi

Deep neural networks (DNNs) have long been recognized as vulnerable to backdoor attacks. By providing poisoned training data in the fine-tuning process, the attacker can implant a backdoor into the victim model. This enables input samples meeting specific textual trigger patterns to be classified as target labels of the attacker's choice. While such black-box attacks have been well explored in both computer vision and natural language processing (NLP), backdoor attacks relying on white-box attack philosophy have hardly been thoroughly investigated. In this paper, we take the first step to introduce a new type of backdoor attack that conceals itself within the underlying model architecture. Specifically, we propose to design separate backdoor modules consisting of two functions: trigger detection and noise injection. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights using Gaussian noise to disturb the feature distribution of the baseline model. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets. We demonstrate that the training-free architectural backdoor on a large language model poses a genuine threat. Unlike the-state-of-art work, it can survive the rigorous fine-tuning and retraining process, as well as evade output probability-based defense methods (i.e. BDDR). All the code and data is available https://github.com/SiSL-URI/Arch_Backdoor_LLM.

9/10/2024

BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

Xiaoyun Xu, Zhuoran Liu, Stefanos Koffas, Shujian Yu, Stjepan Picek

Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$times$ (on CIFAR-10) and 5.11$times$ (on ImageNet200) more efficient with 9.99% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available.url{https://anonymous.4open.science/r/ban-4B32}

5/31/2024