Naive Bayes Classifiers over Missing Data: Decision and Poisoning

Read original: arXiv:2303.04811 - Published 5/29/2024 by Song Bian, Xiating Ouyang, Zhiwei Fan, Paraschos Koutris

➖

Overview

This paper studies the certifiable robustness of machine learning (ML) classifiers on datasets that could contain missing values.
A test point is certifiably robust for an ML classifier if the classifier gives the same prediction for that test point, regardless of which cleaned version of the dirty dataset the classifier is trained on.
The paper shows that for Naive Bayes Classifiers (NBCs) over dirty datasets with missing values:
1. There exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset.
2. The data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells into the clean dataset, is polynomial time for single test points but NP-complete for multiple test points.
Experiments demonstrate that the algorithms are efficient and outperform existing baselines.

Plain English Explanation

This research focuses on the robustness of machine learning (ML) classifiers, specifically Naive Bayes Classifiers (NBCs), when dealing with datasets that contain missing values. The key idea is that a test point (an input to the ML model) is considered "certifiably robust" if the classifier always gives the same prediction for that test point, no matter which cleaned version of the original "dirty" dataset (the one with missing values) the classifier is trained on.

The researchers show that for NBCs and datasets with missing values, there is an efficient algorithm that can quickly determine whether multiple test points are all certifiably robust. They also demonstrate that while it's possible to intentionally make a single test point non-robust by inserting missing values into the dataset (a "data poisoning attack"), it becomes much more difficult to do this for multiple test points at the same time - in fact, it's an NP-complete problem, which means it's computationally very hard to solve.

The significance of this work is that it provides a way to understand the reliability and consistency of ML classifiers, even when the data they're trained on is imperfect or incomplete. This could be useful in various applications where robustness is important, such as certified robustness against sparse adversarial perturbations or balancing accuracy and robustness in ML models.

Technical Explanation

The key technical contributions of this paper are:

Efficient Algorithm for Certifiable Robustness: The researchers prove that for Naive Bayes Classifiers (NBCs) over dirty datasets with missing values, there exists an efficient polynomial-time algorithm to decide whether multiple input test points are all certifiably robust. This means that the algorithm can quickly determine if the classifier will give the same prediction for a set of test points, regardless of which cleaned version of the dirty dataset the classifier is trained on.
Complexity of Data Poisoning Attacks: The paper also analyzes the computational complexity of data poisoning attacks, which aim to make all input test points certifiably non-robust by inserting missing cells into the clean dataset. The researchers show that while this attack is polynomial-time for single test points, it becomes NP-complete (computationally very hard) for multiple test points.

The experiments in the paper demonstrate that the proposed algorithms are efficient and outperform existing baselines for certifying the robustness of NBCs on dirty datasets with missing values. This work builds on previous research in Byzantine-robust optimization against data poisoning and certified robustness against sparse adversarial perturbations.

Critical Analysis

The paper provides a strong theoretical foundation for understanding the certifiable robustness of Naive Bayes Classifiers (NBCs) on dirty datasets with missing values. The efficient algorithm for certifying robustness and the complexity analysis of data poisoning attacks are valuable contributions to the field.

However, the paper focuses solely on NBCs, and it would be interesting to see if the results extend to other types of ML classifiers. Additionally, the paper does not address the practical implications of certifiable robustness or how it might be applied in real-world scenarios.

Furthermore, the paper does not discuss the potential limitations of the proposed approach, such as the assumptions made about the data distribution or the specific types of missing values considered. It would be helpful to understand the boundaries of the approach and the conditions under which it might not be applicable.

Overall, the research provides a solid foundation for understanding the robustness of ML classifiers on imperfect data, and the findings could be valuable for researchers and practitioners working on balancing accuracy and robustness in ML models or [defending against poisoning-based backdoor attacks.

Conclusion

This paper presents a detailed analysis of the certifiable robustness of Naive Bayes Classifiers (NBCs) on dirty datasets with missing values. The key findings include an efficient algorithm for certifying the robustness of multiple test points and the computational complexity of data poisoning attacks.

The research demonstrates the importance of understanding the reliability and consistency of ML classifiers, even when the training data is imperfect. The results could have implications for various applications where robustness is critical, such as certified robustness against sparse adversarial perturbations or balancing accuracy and robustness in ML models.

While the paper focuses on NBCs, further research could explore the extensibility of the approach to other types of ML classifiers. Additionally, investigating the practical applications and potential limitations of certifiable robustness on dirty datasets could be valuable avenues for future work.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

➖

Naive Bayes Classifiers over Missing Data: Decision and Poisoning

Song Bian, Xiating Ouyang, Zhiwei Fan, Paraschos Koutris

We study the certifiable robustness of ML classifiers on dirty datasets that could contain missing values. A test point is certifiably robust for an ML classifier if the classifier returns the same prediction for that test point, regardless of which cleaned version (among exponentially many) of the dirty dataset the classifier is trained on. In this paper, we show theoretically that for Naive Bayes Classifiers (NBC) over dirty datasets with missing values: (i) there exists an efficient polynomial time algorithm to decide whether multiple input test points are all certifiably robust over a dirty dataset; and (ii) the data poisoning attack, which aims to make all input test points certifiably non-robust by inserting missing cells to the clean dataset, is in polynomial time for single test points but NP-complete for multiple test points. Extensive experiments demonstrate that our algorithms are efficient and outperform existing baselines.

5/29/2024

Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks

Lukas Gosch, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar, Stephan Gunnemann

Generalization of machine learning models can be severely compromised by data poisoning, where adversarial changes are applied to the training data, as well as backdoor attacks that additionally manipulate the test data. These vulnerabilities have led to interest in certifying (i.e., proving) that such changes up to a certain magnitude do not affect test predictions. We, for the first time, certify Graph Neural Networks (GNNs) against poisoning and backdoor attacks targeting the node features of a given graph. Our certificates are white-box and based upon $(i)$ the neural tangent kernel, which characterizes the training dynamics of sufficiently wide networks; and $(ii)$ a novel reformulation of the bilevel optimization problem describing poisoning as a mixed-integer linear program. Consequently, we leverage our framework to provide fundamental insights into the role of graph structure and its connectivity on the worst-case robustness behavior of convolution-based and PageRank-based GNNs. We note that our framework is more general and constitutes the first approach to derive white-box poisoning certificates for NNs, which can be of independent interest beyond graph-related tasks.

7/16/2024

Certified Robustness to Data Poisoning in Gradient-Based Training

Philip Sosnin, Mark N. Muller, Maximilian Baader, Calvin Tsay, Matthew Wicker

Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. However, provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge and develop the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data. In particular, our framework certifies robustness against untargeted and targeted poisoning as well as backdoor attacks for both input and label manipulations. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

6/11/2024

🛠️

On the Relevance of Byzantine Robust Optimization Against Data Poisoning

Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Rafael Pinot

The success of machine learning (ML) has been intimately linked with the availability of large amounts of data, typically collected from heterogeneous sources and processed on vast networks of computing devices (also called {em workers}). Beyond accuracy, the use of ML in critical domains such as healthcare and autonomous driving calls for robustness against {em data poisoning}and some {em faulty workers}. The problem of {em Byzantine ML} formalizes these robustness issues by considering a distributed ML environment in which workers (storing a portion of the global dataset) can deviate arbitrarily from the prescribed algorithm. Although the problem has attracted a lot of attention from a theoretical point of view, its practical importance for addressing realistic faults (where the behavior of any worker is locally constrained) remains unclear. It has been argued that the seemingly weaker threat model where only workers' local datasets get poisoned is more reasonable. We prove that, while tolerating a wider range of faulty behaviors, Byzantine ML yields solutions that are, in a precise sense, optimal even under the weaker data poisoning threat model. Then, we study a generic data poisoning model wherein some workers have {em fully-poisonous local data}, i.e., their datasets are entirely corruptible, and the remainders have {em partially-poisonous local data}, i.e., only a fraction of their local datasets is corruptible. We prove that Byzantine-robust schemes yield optimal solutions against both these forms of data poisoning, and that the former is more harmful when workers have {em heterogeneous} local data.

5/2/2024