Partial train and isolate, mitigate backdoor attack

2405.16488

Published 6/7/2024 by Yong Li, Han Gao

🌐

Abstract

Neural networks are widely known to be vulnerable to backdoor attacks, a method that poisons a portion of the training data to make the target model perform well on normal data sets, while outputting attacker-specified or random categories on the poisoned samples. Backdoor attacks are full of threats. Poisoned samples are becoming more and more similar to corresponding normal samples, and even the human eye cannot easily distinguish them. On the other hand, the accuracy of models carrying backdoors on normal samples is no different from that of clean models.In this article, by observing the characteristics of backdoor attacks, We provide a new model training method (PT) that freezes part of the model to train a model that can isolate suspicious samples. Then, on this basis, a clean model is fine-tuned to resist backdoor attacks.

Create account to get full access

Overview

Neural networks are vulnerable to backdoor attacks, which involve poisoning a portion of the training data to make the model perform well on normal data but output attacker-specified or random categories on the poisoned samples.
Poisoned samples are becoming increasingly similar to corresponding normal samples, making them hard for the human eye to distinguish.
The accuracy of models carrying backdoors on normal samples is no different from that of clean models.
The paper proposes a new model training method (PT) that freezes part of the model to isolate suspicious samples, then fine-tunes a clean model to resist backdoor attacks.

Plain English Explanation

Neural networks, a type of machine learning model, have a known weakness: they can be tricked by backdoor attacks. Backdoor attacks involve secretly injecting poisoned samples into the model's training data. These poisoned samples look very similar to normal, legitimate data, but when the model sees them, it outputs an attacker-chosen or random category instead of the correct one.

Importantly, when the model is tested on normal, unpoisoned data, its accuracy is no different from a model that wasn't tampered with. This makes backdoor attacks particularly insidious - the model appears to work well, but it can be secretly manipulated to do the attacker's bidding.

To address this threat, the researchers propose a new training method called PT. The key idea is to first train part of the model to identify suspicious, potentially poisoned samples. Then, a clean version of the full model is fine-tuned on top of this, making it resistant to backdoor attacks.

This approach leverages the model's internal workings to isolate and neutralize the backdoor, without significantly impacting the model's normal performance. By being proactive about detecting and mitigating backdoor attacks, this method aims to make neural networks more secure and reliable.

Technical Explanation

The paper introduces a new model training method called Proactive Training (PT) to mitigate the threat of backdoor attacks on neural networks.

The key insight is that by freezing part of the model during training, it is possible to create a sub-network that is specialized in identifying potentially suspicious or poisoned samples. This sub-network can then be used to isolate such samples, allowing the main model to be fine-tuned in a way that is robust to backdoor attacks.

Specifically, the PT method involves two main steps:

Freezing the backbone: The researchers freeze the initial layers of the neural network, creating a fixed feature extractor. This backbone is then trained to differentiate between normal and potentially poisoned samples.
Fine-tuning the head: With the backbone frozen, a new "head" is trained on top of the fixed features. This head model is fine-tuned to perform the target task accurately on normal data, while being resistant to the effects of the poisoned samples identified by the backbone.

By taking this proactive, two-stage approach, the PT method is able to maintain the model's performance on clean data while effectively isolating and neutralizing the backdoor. This represents a significant advancement in the ongoing effort to make neural networks more secure and reliable in the face of increasingly sophisticated backdoor attacks.

Critical Analysis

The PT method proposed in this paper is a promising approach to mitigating the threat of backdoor attacks on neural networks. By proactively training a sub-network to identify potentially poisoned samples, the researchers have introduced an effective way to isolate and neutralize the backdoor without significantly impacting the model's normal performance.

However, the paper does not address the question of how to ensure that the backbone network itself is not vulnerable to backdoor attacks. If the initial feature extractor is compromised, the entire defense mechanism could be undermined. Further research is needed to explore ways of making the backbone more resilient to such attacks.

Additionally, the paper focuses on a specific type of backdoor attack where the poisoned samples are visually similar to normal data. It's unclear how well the PT method would perform against more advanced backdoor attacks that target the model's semantic understanding rather than just its visual perception.

Overall, the PT method represents an important step forward in the ongoing effort to secure neural networks against backdoor attacks. By encouraging a more proactive and holistic approach to model training and defense, this research could inspire further advancements in this critical area of machine learning.

Conclusion

This paper introduces a new model training method called Proactive Training (PT) that aims to mitigate the threat of backdoor attacks on neural networks. By freezing part of the model to create a specialized sub-network for detecting potentially poisoned samples, the PT method is able to isolate and neutralize the backdoor while maintaining the model's performance on clean data.

While the proposed approach shows promise, further research is needed to address the potential vulnerabilities of the backbone network and explore its effectiveness against more advanced backdoor attack techniques. Nevertheless, the PT method represents an important step forward in the ongoing effort to make neural networks more secure and reliable in the face of these insidious threats.

As machine learning models become increasingly ubiquitous in real-world applications, the need for robust defenses against backdoor attacks will only grow. The insights and techniques presented in this paper could serve as a valuable foundation for future work in this critical area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor

Shaokui Wei, Hongyuan Zha, Baoyuan Wu

Data-poisoning backdoor attacks are serious security threats to machine learning models, where an adversary can manipulate the training dataset to inject backdoors into models. In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned. Unlike most existing methods that primarily detect and remove/unlearn suspicious samples to mitigate malicious backdoor attacks, we propose a novel defense approach called PDB (Proactive Defensive Backdoor). Specifically, PDB leverages the home field advantage of defenders by proactively injecting a defensive backdoor into the model during training. Taking advantage of controlling the training process, the defensive backdoor is designed to suppress the malicious backdoor effectively while remaining secret to attackers. In addition, we introduce a reversible mapping to determine the defensive target label. During inference, PDB embeds a defensive trigger in the inputs and reverses the model's prediction, suppressing malicious backdoor and ensuring the model's utility on the original task. Experimental results across various datasets and models demonstrate that our approach achieves state-of-the-art defense performance against a wide range of backdoor attacks.

5/28/2024

cs.CR cs.CV

Efficient Backdoor Attacks for Deep Neural Networks in Real-world Scenarios

Ziqiang Li, Hong Sun, Pengfei Xia, Heng Li, Beihao Xia, Yi Wu, Bin Li

Recent deep neural networks (DNNs) have came to rely on vast amounts of training data, providing an opportunity for malicious attackers to exploit and contaminate the data to carry out backdoor attacks. However, existing backdoor attack methods make unrealistic assumptions, assuming that all training data comes from a single source and that attackers have full access to the training data. In this paper, we introduce a more realistic attack scenario where victims collect data from multiple sources, and attackers cannot access the complete training data. We refer to this scenario as data-constrained backdoor attacks. In such cases, previous attack methods suffer from severe efficiency degradation due to the entanglement between benign and poisoning features during the backdoor injection process. To tackle this problem, we introduce three CLIP-based technologies from two distinct streams: Clean Feature Suppression and Poisoning Feature Augmentation.effective solution for data-constrained backdoor attacks. The results demonstrate remarkable improvements, with some settings achieving over 100% improvement compared to existing attacks in data-constrained scenarios. Code is available at https://github.com/sunh1113/Efficient-backdoor-attacks-for-deep-neural-networks-in-real-world-scenarios

4/22/2024

cs.CR cs.CV

🏋️

SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks

Xuanli He, Qiongkai Xu, Jun Wang, Benjamin I. P. Rubinstein, Trevor Cohn

Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model's behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets.

5/21/2024

cs.CL cs.CR

Poisoning-based Backdoor Attacks for Arbitrary Target Label with Positive Triggers

Binxiao Huang, Jason Chun Lok, Chang Liu, Ngai Wong

Poisoning-based backdoor attacks expose vulnerabilities in the data preparation stage of deep neural network (DNN) training. The DNNs trained on the poisoned dataset will be embedded with a backdoor, making them behave well on clean data while outputting malicious predictions whenever a trigger is applied. To exploit the abundant information contained in the input data to output label mapping, our scheme utilizes the network trained from the clean dataset as a trigger generator to produce poisons that significantly raise the success rate of backdoor attacks versus conventional approaches. Specifically, we provide a new categorization of triggers inspired by the adversarial technique and develop a multi-label and multi-payload Poisoning-based backdoor attack with Positive Triggers (PPT), which effectively moves the input closer to the target label on benign classifiers. After the classifier is trained on the poisoned dataset, we can generate an input-label-aware trigger to make the infected classifier predict any given input to any target label with a high possibility. Under both dirty- and clean-label settings, we show empirically that the proposed attack achieves a high attack success rate without sacrificing accuracy across various datasets, including SVHN, CIFAR10, GTSRB, and Tiny ImageNet. Furthermore, the PPT attack can elude a variety of classical backdoor defenses, proving its effectiveness.

5/10/2024

cs.CV cs.CR