Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference

2405.04043

Published 5/8/2024 by Conor Hassan, Matthew Sutton, Antonietta Mira, Kerrie Mengersen

📊

Abstract

Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains.

Create account to get full access

Overview

This paper introduces a novel framework for fitting Bayesian models in the context of Vertical Federated Learning (VFL), where multiple clients hold different sets of covariates.
The authors propose a data augmentation technique to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms.
They present a model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods.
To address the dimensionality challenge posed by data augmentation, the authors develop a factorized amortized variational approximation that achieves scalability independent of the number of observations.
The framework is evaluated on logistic regression, multilevel regression, and a hierarchical Bayesian split neural net model, demonstrating its effectiveness.

Plain English Explanation

In Vertical Federated Learning (VFL), multiple organizations or clients collaborate to train a machine learning model without sharing their private data. Each client holds a different set of information, like customer demographics or purchase history. The paper presents a new approach to build Bayesian models in this VFL setting.

The key idea is to use a technique called "data augmentation" to transform the VFL problem into a form that can work with existing Bayesian federated learning algorithms. This allows the clients to collaborate on building a Bayesian model without revealing their private data.

The authors also develop a special model structure that breaks down the overall likelihood function into separate parts for each client. This helps address a challenge with the data augmentation approach, where the complexity can grow quickly as the number of observations and clients increases.

The paper demonstrates the effectiveness of this framework through experiments on different types of models, like logistic regression and neural networks. This paves the way for private, decentralized Bayesian inference in real-world scenarios where data is vertically partitioned across multiple organizations.

Technical Explanation

The paper introduces a comprehensive framework for fitting Bayesian models in the Vertical Federated Learning (VFL) setting, where multiple clients hold distinct sets of covariates. The authors propose a novel data augmentation approach to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms.

For specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods, the authors present an innovative model formulation. To address the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, they develop a factorized amortized variational approximation that achieves scalability independent of the number of observations.

The proposed framework is evaluated through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. The results demonstrate the efficacy of the authors' approach, paving the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios.

Critical Analysis

The paper presents a valuable contribution to the field of Vertical Federated Learning (VFL) by introducing a comprehensive Bayesian modeling framework. The data augmentation technique and the factorized amortized variational approximation are innovative approaches that address the challenges posed by the VFL setting.

However, the paper does not discuss the potential limitations of the proposed framework. For example, the performance of the Bayesian models may be sensitive to the choice of prior distributions, and the data augmentation technique may introduce additional computational overhead. Additionally, the paper could have explored the robustness of the framework to model misspecification or the impact of different client data distributions on the final model performance.

Further research could also investigate the integration of this Bayesian VFL framework with other advancements in the field, such as TabVFL or VFLAIR, to enhance the overall performance and versatility of the approach. Exploring the application of this framework to other Bayesian models, such as Federated Bayesian Deep Learning or VFLGAN, could also broaden the scope and impact of the research.

Conclusion

This paper introduces a novel framework for fitting Bayesian models in the context of Vertical Federated Learning (VFL), where multiple clients hold distinct sets of covariates. The authors propose a data augmentation technique to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms, and they present a model formulation for specific VFL scenarios.

To address the dimensionality challenge posed by data augmentation, the researchers develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. The effectiveness of the framework is demonstrated through extensive experiments on various Bayesian models, including logistic regression, multilevel regression, and a hierarchical Bayesian split neural net model.

This work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in diverse domains. The framework's integration with other advancements in the field and its potential application to a broader range of Bayesian models present exciting opportunities for future exploration.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⚙️

Improving Privacy-Preserving Vertical Federated Learning by Efficient Communication with ADMM

Chulin Xie, Pin-Yu Chen, Qinbin Li, Arash Nourian, Ce Zhang, Bo Li

Federated learning (FL) enables distributed resource-constrained devices to jointly train shared models while keeping the training data local for privacy purposes. Vertical FL (VFL), which allows each client to collect partial features, has attracted intensive research efforts recently. We identified the main challenges that existing VFL frameworks are facing: the server needs to communicate gradients with the clients for each training step, incurring high communication cost that leads to rapid consumption of privacy budgets. To address these challenges, in this paper, we introduce a VFL framework with multiple heads (VIM), which takes the separate contribution of each client into account, and enables an efficient decomposition of the VFL optimization objective to sub-objectives that can be iteratively tackled by the server and the clients on their own. In particular, we propose an Alternating Direction Method of Multipliers (ADMM)-based method to solve our optimization problem, which allows clients to conduct multiple local updates before communication, and thus reduces the communication cost and leads to better performance under differential privacy (DP). We provide the user-level DP mechanism for our framework to protect user privacy. Moreover, we show that a byproduct of VIM is that the weights of learned heads reflect the importance of local clients. We conduct extensive evaluations and show that on four vertical FL datasets, VIM achieves significantly higher performance and faster convergence compared with the state-of-the-art. We also explicitly evaluate the importance of local clients and show that VIM enables functionalities such as client-level explanation and client denoising. We hope this work will shed light on a new way of effective VFL training and understanding.

4/9/2024

cs.LG cs.CR

Vertical Federated Learning for Effectiveness, Security, Applicability: A Survey

Mang Ye, Wei Shen, Bo Du, Eduard Snezhko, Vassili Kovalev, Pong C. Yuen

Vertical Federated Learning (VFL) is a privacy-preserving distributed learning paradigm where different parties collaboratively learn models using partitioned features of shared samples, without leaking private data. Recent research has shown promising results addressing various challenges in VFL, highlighting its potential for practical applications in cross-domain collaboration. However, the corresponding research is scattered and lacks organization. To advance VFL research, this survey offers a systematic overview of recent developments. First, we provide a history and background introduction, along with a summary of the general training protocol of VFL. We then revisit the taxonomy in recent reviews and analyze limitations in-depth. For a comprehensive and structured discussion, we synthesize recent research from three fundamental perspectives: effectiveness, security, and applicability. Finally, we discuss several critical future research directions in VFL, which will facilitate the developments in this field. We provide a collection of research lists and periodically update them at https://github.com/shentt67/VFL_Survey.

6/5/2024

cs.LG cs.CR

Entity Augmentation for Efficient Classification of Vertically Partitioned Data with Limited Overlap

Avi Amalanshu, Viswesh Nagaswamy, G. V. S. S. Prudhvi, Yash Sirvi, Debashish Chakravarty

Vertical Federated Learning (VFL) is a machine learning paradigm for learning from vertically partitioned data (i.e. features for each input are distributed across multiple guest clients and an aggregating host server owns labels) without communicating raw data. Traditionally, VFL involves an entity resolution phase where the host identifies and serializes the unique entities known to all guests. This is followed by private set intersection to find common entities, and an entity alignment step to ensure all guests are always processing the same entity's data. However, using only data of entities from the intersection means guests discard potentially useful data. Besides, the effect on privacy is dubious and these operations are computationally expensive. We propose a novel approach that eliminates the need for set intersection and entity alignment in categorical tasks. Our Entity Augmentation technique generates meaningful labels for activations sent to the host, regardless of their originating entity, enabling efficient VFL without explicit entity alignment. With limited overlap between training data, this approach performs substantially better (e.g. with 5% overlap, 48.1% vs 69.48% test accuracy on CIFAR-10). In fact, thanks to the regularizing effect, our model performs marginally better even with 100% overlap.

6/27/2024

cs.LG cs.CV cs.DC

UIFV: Data Reconstruction Attack in Vertical Federated Learning

Jirui Yang, Peng Chen, Zhihui Lu, Qiang Duan, Yubing Bao

Vertical Federated Learning (VFL) facilitates collaborative machine learning without the need for participants to share raw private data. However, recent studies have revealed privacy risks where adversaries might reconstruct sensitive features through data leakage during the learning process. Although data reconstruction methods based on gradient or model information are somewhat effective, they reveal limitations in VFL application scenarios. This is because these traditional methods heavily rely on specific model structures and/or have strict limitations on application scenarios. To address this, our study introduces the Unified InverNet Framework into VFL, which yields a novel and flexible approach (dubbed UIFV) that leverages intermediate feature data to reconstruct original data, instead of relying on gradients or model details. The intermediate feature data is the feature exchanged by different participants during the inference phase of VFL. Experiments on four datasets demonstrate that our methods significantly outperform state-of-the-art techniques in attack precision. Our work exposes severe privacy vulnerabilities within VFL systems that pose real threats to practical VFL applications and thus confirms the necessity of further enhancing privacy protection in the VFL architecture.

6/19/2024

cs.LG cs.AI cs.CR stat.ML