Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks

Read original: arXiv:2212.10717 - Published 8/2/2024 by Jimmy Z. Di, Jack Douglas, Jayadev Acharya, Gautam Kamath, Ayush Sekhari

❗

Overview

Introduces a new type of attack called "camouflaged data poisoning" in the context of machine unlearning
Adversary adds carefully crafted data points to the training dataset that have minimal impact on model predictions
Adversary then triggers a request to remove a subset of the introduced points, causing the model's predictions to be negatively affected
Experiments on CIFAR-10, Imagenette, and Imagewoof datasets show this "clean-label targeted attack" can cause models to misclassify specific test points

Plain English Explanation

In this paper, the researchers introduce a new type of attack called "camouflaged data poisoning" that can be used to trick machine learning models. The idea is that an adversary [internal link: adversary] first adds a few carefully designed data points to the training dataset. These points are crafted in a way that they have minimal impact on the model's predictions during normal use.

However, the adversary then triggers a request to remove a subset of the introduced points. At this point, the attack is unleashed, and the model's predictions become negatively affected. The researchers show that this "clean-label targeted attack" can cause models trained on datasets like [internal link: CIFAR-10], [internal link: Imagenette], and [internal link: Imagewoof] to misclassify specific test points.

The key insight here is that the adversary is able to "camouflage" the impact of the poisoned data, making it hard to detect. This attack vector arises in the context of machine unlearning and other settings where retraining the model may be required.

Technical Explanation

The researchers propose a new attack called "camouflaged data poisoning" that targets machine learning models in the context of machine unlearning. The attack works as follows:

The adversary first adds a small number of carefully crafted data points to the training dataset. These "camouflage" points are designed to have minimal impact on the model's predictions during normal use.
The adversary then triggers a request to remove a subset of the introduced points. At this point, the attack is unleashed, and the model's predictions become negatively affected.

The researchers evaluate this "clean-label targeted attack" on several image classification datasets, including [internal link: CIFAR-10], [internal link: Imagenette], and [internal link: Imagewoof]. They demonstrate that the attack can cause the model to misclassify specific test points, even though the poisoned data points themselves appear benign.

The key technical insight is that the adversary is able to "camouflage" the impact of the poisoned data, making it hard to detect. This is achieved by carefully crafting the data points to have minimal impact on the model's predictions during normal use, while still triggering the desired behavior when a subset of the points is removed.

Critical Analysis

The paper presents a novel and potentially impactful attack vector that arises in the context of machine unlearning and other settings where model retraining may be required. The researchers demonstrate the attack's effectiveness on several image classification datasets, which is concerning given the widespread use of such models in real-world applications.

However, the paper does not explore the attack's feasibility or potential impact in more practical, real-world settings. The researchers use relatively simple datasets and attack scenarios, and it's unclear how the attack would scale or perform against more robust and complex models.

Additionally, the paper does not provide in-depth discussion of potential countermeasures or defenses against this type of attack. While the authors mention the need for further research in this area, more guidance on how to detect and mitigate this threat would be valuable for the community.

Overall, the paper presents an interesting and important new attack vector that warrants further investigation and consideration by the machine learning research community and practitioners.

Conclusion

This paper introduces a novel type of attack called "camouflaged data poisoning" that can be used to trick machine learning models, particularly in the context of machine unlearning. The key insight is that an adversary can carefully craft poisoned data points that have minimal impact on the model's predictions during normal use, but can then trigger a dramatic negative effect when a subset of the points is removed.

While the paper demonstrates the attack's effectiveness on several image classification datasets, more research is needed to understand its feasibility and potential impact in real-world settings. Exploring robust countermeasures and defenses against this type of attack should also be a priority for the machine learning community going forward.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

❗

Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks

Jimmy Z. Di, Jack Douglas, Jayadev Acharya, Gautam Kamath, Ayush Sekhari

We introduce camouflaged data poisoning attacks, a new attack vector that arises in the context of machine unlearning and other settings when model retraining may be induced. An adversary first adds a few carefully crafted points to the training dataset such that the impact on the model's predictions is minimal. The adversary subsequently triggers a request to remove a subset of the introduced points at which point the attack is unleashed and the model's predictions are negatively affected. In particular, we consider clean-label targeted attacks (in which the goal is to cause the model to misclassify a specific test point) on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.

8/2/2024

Potion: Towards Poison Unlearning

Stefan Schoepf, Jack Foster, Alexandra Brintrup

Adversarial attacks by malicious actors on machine learning systems, such as introducing poison triggers into training datasets, pose significant risks. The challenge in resolving such an attack arises in practice when only a subset of the poisoned data can be identified. This necessitates the development of methods to remove, i.e. unlearn, poison triggers from already trained models with only a subset of the poison data available. The requirements for this task significantly deviate from privacy-focused unlearning where all of the data to be forgotten by the model is known. Previous work has shown that the undiscovered poisoned samples lead to a failure of established unlearning methods, with only one method, Selective Synaptic Dampening (SSD), showing limited success. Even full retraining, after the removal of the identified poison, cannot address this challenge as the undiscovered poison samples lead to a reintroduction of the poison trigger in the model. Our work addresses two key challenges to advance the state of the art in poison unlearning. First, we introduce a novel outlier-resistant method, based on SSD, that significantly improves model protection and unlearning performance. Second, we introduce Poison Trigger Neutralisation (PTN) search, a fast, parallelisable, hyperparameter search that utilises the characteristic unlearning versus model protection trade-off to find suitable hyperparameters in settings where the forget set size is unknown and the retain set is contaminated. We benchmark our contributions using ResNet-9 on CIFAR10 and WideResNet-28x10 on CIFAR100. Experimental results show that our method heals 93.72% of poison compared to SSD with 83.41% and full retraining with 40.68%. We achieve this while also lowering the average model accuracy drop caused by unlearning from 5.68% (SSD) to 1.41% (ours).

9/12/2024

Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

Quang H. Nguyen, Nguyen Ngoc-Hieu, The-Anh Ta, Thanh Nguyen-Tang, Kok-Seng Wong, Hoang Thanh-Tung, Khoa D. Doan

Deep neural networks are vulnerable to backdoor attacks, a type of adversarial attack that poisons the training data to manipulate the behavior of models trained on such data. Clean-label attacks are a more stealthy form of backdoor attacks that can perform the attack without changing the labels of poisoned data. Early works on clean-label attacks added triggers to a random subset of the training set, ignoring the fact that samples contribute unequally to the attack's success. This results in high poisoning rates and low attack success rates. To alleviate the problem, several supervised learning-based sample selection strategies have been proposed. However, these methods assume access to the entire labeled training set and require training, which is expensive and may not always be practical. This work studies a new and more practical (but also more challenging) threat model where the attacker only provides data for the target class (e.g., in face recognition systems) and has no knowledge of the victim model or any other classes in the training set. We study different strategies for selectively poisoning a small set of training samples in the target class to boost the attack success rate in this setting. Our threat model poses a serious threat in training machine learning models with third-party datasets, since the attack can be performed effectively with limited information. Experiments on benchmark datasets illustrate the effectiveness of our strategies in improving clean-label backdoor attacks.

7/17/2024

Transferable Availability Poisoning Attacks

Yiyong Liu, Michael Backes, Xiao Zhang

We consider availability data poisoning attacks, where an adversary aims to degrade the overall test accuracy of a machine learning model by crafting small perturbations to its training data. Existing poisoning strategies can achieve the attack goal but assume the victim to employ the same learning method as what the adversary uses to mount the attack. In this paper, we argue that this assumption is strong, since the victim may choose any learning algorithm to train the model as long as it can achieve some targeted performance on clean data. Empirically, we observe a large decrease in the effectiveness of prior poisoning attacks if the victim employs an alternative learning algorithm. To enhance the attack transferability, we propose Transferable Poisoning, which first leverages the intrinsic characteristics of alignment and uniformity to enable better unlearnability within contrastive learning, and then iteratively utilizes the gradient information from supervised and unsupervised contrastive learning paradigms to generate the poisoning perturbations. Through extensive experiments on image benchmarks, we show that our transferable poisoning attack can produce poisoned samples with significantly improved transferability, not only applicable to the two learners used to devise the attack but also to learning algorithms and even paradigms beyond.

6/7/2024