S-BDT: Distributed Differentially Private Boosted Decision Trees

Read original: arXiv:2309.12041 - Published 8/19/2024 by Thorsten Peinemann, Moritz Kirschte, Joshua Stock, Carlos Cotrini, Esfandiar Mohammadi

🎯

Overview

Introduces S-BDT, a novel differentially private distributed gradient boosted decision tree (GBDT) learner
Aims to improve privacy protection of individual training data points while achieving meaningful learning goals
Uses less noise by relying on non-spherical multivariate Gaussian noise
Provides tight subsampling bounds for privacy amplification and incorporates that into a Rényi filter for individual privacy accounting
Experimentally demonstrates savings in privacy budget (epsilon) while maintaining similar utility on regression and classification datasets

Plain English Explanation

S-BDT is a new machine learning technique that combines differential privacy with gradient boosted decision trees. Differential privacy helps protect the privacy of individual data points used to train the model, while gradient boosting is a powerful technique for building accurate predictive models.

The key innovation in S-BDT is that it uses a special type of random noise, called non-spherical multivariate Gaussian noise, to add just the right amount of privacy protection without sacrificing too much model accuracy. The researchers also developed new mathematical techniques to understand how this noise affects the overall privacy guarantees.

Through experiments on several datasets, the researchers show that S-BDT can achieve the same level of model performance as standard gradient boosting, but with a 30-50% reduction in the amount of privacy budget (epsilon) required. This means they can train highly accurate models while providing stronger privacy protections for the individuals whose data was used.

The researchers also found that S-BDT performs even better when the training data comes from different subpopulations (i.e., is not independent and identically distributed). This suggests S-BDT could be particularly useful in real-world scenarios where data comes from diverse sources.

Technical Explanation

S-BDT is a differentially private distributed gradient boosted decision tree (GBDT) learner. It aims to improve the privacy protection of individual training data points while still achieving meaningful learning goals like high accuracy or low regression error.

The key technical innovations in S-BDT are:

Non-Spherical Multivariate Gaussian Noise: S-BDT uses a special type of random noise that is non-spherical and multivariate Gaussian. This allows it to add less noise compared to standard techniques while still providing strong differential privacy guarantees.
Tight Subsampling Bounds: The researchers derived tight mathematical bounds on how the privacy guarantees are amplified when subsampling the training data. This is incorporated into a Rényi differential privacy accounting mechanism for individual privacy tracking.

Experimentally, S-BDT was able to match the utility (accuracy or regression performance) of standard GBDT models, while saving 50% of the privacy budget (epsilon) for the Abalone regression dataset, 30% for the Adult classification dataset, and 30% for the Spambase classification dataset. The savings were even greater when the training data came from different subpopulations (non-IID).

Critical Analysis

The paper provides a thorough technical explanation of the S-BDT algorithm and its privacy guarantees. The experimental results demonstrate significant improvements in privacy-utility tradeoffs compared to standard GBDT models.

One potential limitation is that the experiments were conducted on relatively small to medium-sized datasets. It would be helpful to see how S-BDT scales to larger, more complex real-world datasets. Additionally, the paper does not discuss the computational overhead or training time of S-BDT compared to standard GBDT.

The paper also does not explore the robustness of S-BDT to adversarial attacks or other potential threats to differential privacy. Further research could investigate the security properties of S-BDT in more depth.

Overall, S-BDT appears to be a promising approach for building accurate machine learning models with strong individual privacy protections. The technical innovations and experimental results suggest it could be a valuable tool for applications where both utility and privacy are paramount.

Conclusion

S-BDT introduces a novel differentially private distributed gradient boosted decision tree (GBDT) learner that can achieve the same level of utility as standard GBDT models while using 30-50% less privacy budget (epsilon). This is enabled by the use of non-spherical multivariate Gaussian noise and tight subsampling bounds for privacy accounting.

The experimental results on regression and classification datasets demonstrate the effectiveness of S-BDT, especially when the training data comes from diverse subpopulations. This suggests S-BDT could be particularly useful in real-world scenarios where data privacy and model performance are both crucial.

While the paper provides a strong technical foundation, further research is needed to explore the scalability, robustness, and security properties of S-BDT. Nonetheless, this work represents an important advance in the field of differentially private machine learning and could have significant implications for a wide range of applications where both utility and privacy are essential.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎯

S-BDT: Distributed Differentially Private Boosted Decision Trees

Thorsten Peinemann, Moritz Kirschte, Joshua Stock, Carlos Cotrini, Esfandiar Mohammadi

We introduce S-BDT: a novel $(varepsilon,delta)$-differentially private distributed gradient boosted decision tree (GBDT) learner that improves the protection of single training data points (privacy) while achieving meaningful learning goals, such as accuracy or regression error (utility). S-BDT uses less noise by relying on non-spherical multivariate Gaussian noise, for which we show tight subsampling bounds for privacy amplification and incorporate that into a R'enyi filter for individual privacy accounting. We experimentally reach the same utility while saving $50%$ in terms of epsilon for $varepsilon le 0.5$ on the Abalone regression dataset (dataset size $approx 4K$), saving $30%$ in terms of epsilon for $varepsilon le 0.08$ for the Adult classification dataset (dataset size $approx 50K$), and saving $30%$ in terms of epsilon for $varepsilonleq0.03$ for the Spambase classification dataset (dataset size $approx 5K$). Moreover, we show that for situations where a GBDT is learning a stream of data that originates from different subpopulations (non-IID), S-BDT improves the saving of epsilon even further.

8/19/2024

Diffusion Boosted Trees

Xizewen Han, Mingyuan Zhou

Combining the merits of both denoising diffusion probabilistic models and gradient boosting, the diffusion boosting paradigm is introduced for tackling supervised learning problems. We develop Diffusion Boosted Trees (DBT), which can be viewed as both a new denoising diffusion generative model parameterized by decision trees (one single tree for each diffusion timestep), and a new boosting algorithm that combines the weak learners into a strong learner of conditional distributions without making explicit parametric assumptions on their density forms. We demonstrate through experiments the advantages of DBT over deep neural network-based diffusion models as well as the competence of DBT on real-world regression tasks, and present a business application (fraud detection) of DBT for classification on tabular data with the ability of learning to defer.

6/5/2024

🚀

SecureBoost+ : A High Performance Gradient Boosting Tree Framework for Large Scale Vertical Federated Learning

Tao Fan, Weijing Chen, Guoqiang Ma, Yan Kang, Lixin Fan, Qiang Yang

Gradient boosting decision tree (GBDT) is an ensemble machine learning algorithm, which is widely used in industry, due to its good performance and easy interpretation. Due to the problem of data isolation and the requirement of privacy, many works try to use vertical federated learning to train machine learning models collaboratively with privacy guarantees between different data owners. SecureBoost is one of the most popular vertical federated learning algorithms for GBDT. However, in order to achieve privacy preservation, SecureBoost involves complex training procedures and time-consuming cryptography operations. This causes SecureBoost to be slow to train and does not scale to large scale data. In this work, we propose SecureBoost+, a large-scale and high-performance vertical federated gradient boosting decision tree framework. SecureBoost+ is secure in the semi-honest model, which is the same as SecureBoost. SecureBoost+ can be scaled up to tens of millions of data samples easily. SecureBoost+ achieves high performance through several novel optimizations for SecureBoost, including ciphertext operation optimization, the introduction of new training mechanisms, and multi-classification training optimization. The experimental results show that SecureBoost+ is 6-35x faster than SecureBoost, but with the same accuracy and can be scaled up to tens of millions of data samples and thousands of feature dimensions.

6/21/2024

PrivSGP-VR: Differentially Private Variance-Reduced Stochastic Gradient Push with Tight Utility Bounds

Zehan Zhu, Yan Huang, Xin Wang, Jinming Xu

In this paper, we propose a differentially private decentralized learning method (termed PrivSGP-VR) which employs stochastic gradient push with variance reduction and guarantees $(epsilon, delta)$-differential privacy (DP) for each node. Our theoretical analysis shows that, under DP Gaussian noise with constant variance, PrivSGP-VR achieves a sub-linear convergence rate of $mathcal{O}(1/sqrt{nK})$, where $n$ and $K$ are the number of nodes and iterations, respectively, which is independent of stochastic gradient variance, and achieves a linear speedup with respect to $n$. Leveraging the moments accountant method, we further derive an optimal $K$ to maximize the model utility under certain privacy budget in decentralized settings. With this optimized $K$, PrivSGP-VR achieves a tight utility bound of $mathcal{O}left( sqrt{dlog left( frac{1}{delta} right)}/(sqrt{n}Jepsilon) right)$, where $J$ and $d$ are the number of local samples and the dimension of decision variable, respectively, which matches that of the server-client distributed counterparts, and exhibits an extra factor of $1/sqrt{n}$ improvement compared to that of the existing decentralized counterparts, such as A(DP)$^2$SGD. Extensive experiments corroborate our theoretical findings, especially in terms of the maximized utility with optimized $K$, in fully decentralized settings.

5/7/2024