VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data Publication

2404.09722

Published 4/16/2024 by Xun Yuan, Yang Yang, Prosanta Gope, Aryan Pasikhani, Biplab Sikdar

VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data Publication

Abstract

In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, good data is not a free lunch and is always hard to access due to privacy regulations like the General Data Protection Regulation (GDPR). A potential solution is to release a synthetic dataset with a similar distribution to that of the private dataset. Nevertheless, in some scenarios, it has been found that the attributes needed to train an AI model belong to different parties, and they cannot share the raw data for synthetic data publication due to privacy regulations. In PETS 2023, Xue et al. proposed the first generative adversary network-based model, VertiGAN, for vertically partitioned data publication. However, after thoroughly investigating, we found that VertiGAN is less effective in preserving the correlation among the attributes of different parties. This article proposes a Vertical Federated Learning-based Generative Adversarial Network, VFLGAN, for vertically partitioned data publication to address the above issues. Our experimental results show that compared with VertiGAN, VFLGAN significantly improves the quality of synthetic data. Taking the MNIST dataset as an example, the quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN w.r.t. the Fr'echet Distance. We also designed a more efficient and effective Gaussian mechanism for the proposed VFLGAN to provide the synthetic dataset with a differential privacy guarantee. On the other hand, differential privacy only gives the upper bound of the worst-case privacy guarantee. This article also proposes a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.

Create account to get full access

Overview

Introduces a new vertical federated learning-based generative adversarial network (VFLGAN) model for privacy-preserving data publication
Addresses the challenge of vertically partitioned data, where different data owners hold different feature sets for the same set of data samples
Enables the generation of synthetic data that preserves the statistical properties of the original data while protecting individual privacy

Plain English Explanation

VFLGAN is a new machine learning model that aims to solve a problem faced by organizations that want to share data, but are concerned about protecting the privacy of the individuals in that data.

Imagine a hospital that wants to share patient data with a research institute, but can't because the data includes sensitive information like medical history and financial records. The hospital holds some features (like medical history) while the research institute holds other features (like financial records) - this is called "vertically partitioned data."

VFLGAN allows the hospital and research institute to work together to generate synthetic data that has the same statistical properties as the original data, but without any individual-level information. This synthetic data can then be shared publicly, enabling valuable research while protecting patient privacy.

The key innovation of VFLGAN is that it uses a technique called "federated learning" to train the model without ever sharing the raw data. Instead, the hospital and research institute each train a part of the model on their own data, and then combine the results to generate the synthetic data.

This approach ensures that no individual's private information is ever revealed, while still allowing the data to be used for important research and analysis. By using generative adversarial networks (GANs) and differential privacy, VFLGAN is able to generate high-quality synthetic data that closely matches the original data distribution.

Technical Explanation

The VFLGAN model consists of several key components:

Vertical Federated Learning: VFLGAN uses a vertical federated learning approach, where each data owner (e.g., the hospital and research institute) trains a portion of the overall model on their own data, and the model components are then combined to generate the final synthetic data.
Generative Adversarial Network (GAN): VFLGAN leverages a GAN architecture, which includes a generator model that learns to produce synthetic data, and a discriminator model that tries to distinguish the synthetic data from the real data. The two models are trained in an adversarial manner to improve the quality of the synthetic data.
Differential Privacy: To ensure privacy protection, VFLGAN incorporates differential privacy mechanisms into the training process. This adds controlled noise to the model updates, making it difficult to infer any individual's private information from the synthetic data.

The experimental evaluation demonstrates that VFLGAN can generate high-quality synthetic data that preserves the statistical properties of the original vertically partitioned data, while providing strong privacy guarantees. The synthetic data can be used for a variety of downstream tasks, such as federated multimodal learning, without revealing any private information.

Critical Analysis

The VFLGAN paper makes a valuable contribution to the field of privacy-preserving data publication, but it also has some limitations and potential areas for further research:

Scalability: The paper focuses on a relatively small-scale example, and it's unclear how the VFLGAN model would scale to larger, more complex datasets with more data owners and feature sets.
Heterogeneous Data: The current VFLGAN model assumes that the data is vertically partitioned, but it doesn't address the challenge of dealing with heterogeneous data formats and distributions across different data owners.
Robustness: The paper doesn't explore the robustness of the VFLGAN model to various types of attacks or adversarial manipulations of the data. It's important to ensure that the synthetic data generated by VFLGAN is resilient to such threats.
Real-world Deployment: While the paper demonstrates the technical feasibility of VFLGAN, there are likely additional challenges and considerations that would need to be addressed for real-world deployment, such as regulatory compliance, stakeholder buy-in, and integration with existing data management systems.

Overall, the VFLGAN paper presents a promising approach to privacy-preserving data publication, but further research and development will be necessary to address these limitations and ensure the model's viability in practical applications.

Conclusion

The VFLGAN model introduced in this paper represents a significant step forward in addressing the challenge of privacy-preserving data publication for vertically partitioned data. By leveraging vertical federated learning, generative adversarial networks, and differential privacy, VFLGAN enables the generation of high-quality synthetic data that preserves the statistical properties of the original data while protecting individual privacy.

This work has important implications for a wide range of applications, from healthcare and finance to smart city planning and beyond, where organizations need to share data for research and analysis but are constrained by privacy concerns. As the field of privacy-preserving data publication continues to evolve, the VFLGAN model and its underlying techniques will likely play a crucial role in unlocking the value of data while safeguarding individual privacy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Vertical Federated Learning for Effectiveness, Security, Applicability: A Survey

Mang Ye, Wei Shen, Bo Du, Eduard Snezhko, Vassili Kovalev, Pong C. Yuen

Vertical Federated Learning (VFL) is a privacy-preserving distributed learning paradigm where different parties collaboratively learn models using partitioned features of shared samples, without leaking private data. Recent research has shown promising results addressing various challenges in VFL, highlighting its potential for practical applications in cross-domain collaboration. However, the corresponding research is scattered and lacks organization. To advance VFL research, this survey offers a systematic overview of recent developments. First, we provide a history and background introduction, along with a summary of the general training protocol of VFL. We then revisit the taxonomy in recent reviews and analyze limitations in-depth. For a comprehensive and structured discussion, we synthesize recent research from three fundamental perspectives: effectiveness, security, and applicability. Finally, we discuss several critical future research directions in VFL, which will facilitate the developments in this field. We provide a collection of research lists and periodically update them at https://github.com/shentt67/VFL_Survey.

6/5/2024

cs.LG cs.CR

Entity Augmentation for Efficient Classification of Vertically Partitioned Data with Limited Overlap

Avi Amalanshu, Viswesh Nagaswamy, G. V. S. S. Prudhvi, Yash Sirvi, Debashish Chakravarty

Vertical Federated Learning (VFL) is a machine learning paradigm for learning from vertically partitioned data (i.e. features for each input are distributed across multiple guest clients and an aggregating host server owns labels) without communicating raw data. Traditionally, VFL involves an entity resolution phase where the host identifies and serializes the unique entities known to all guests. This is followed by private set intersection to find common entities, and an entity alignment step to ensure all guests are always processing the same entity's data. However, using only data of entities from the intersection means guests discard potentially useful data. Besides, the effect on privacy is dubious and these operations are computationally expensive. We propose a novel approach that eliminates the need for set intersection and entity alignment in categorical tasks. Our Entity Augmentation technique generates meaningful labels for activations sent to the host, regardless of their originating entity, enabling efficient VFL without explicit entity alignment. With limited overlap between training data, this approach performs substantially better (e.g. with 5% overlap, 48.1% vs 69.48% test accuracy on CIFAR-10). In fact, thanks to the regularizing effect, our model performs marginally better even with 100% overlap.

6/27/2024

cs.LG cs.CV cs.DC

📊

Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference

Conor Hassan, Matthew Sutton, Antonietta Mira, Kerrie Mengersen

Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains.

5/8/2024

cs.LG stat.ML

UIFV: Data Reconstruction Attack in Vertical Federated Learning

Jirui Yang, Peng Chen, Zhihui Lu, Qiang Duan, Yubing Bao

Vertical Federated Learning (VFL) facilitates collaborative machine learning without the need for participants to share raw private data. However, recent studies have revealed privacy risks where adversaries might reconstruct sensitive features through data leakage during the learning process. Although data reconstruction methods based on gradient or model information are somewhat effective, they reveal limitations in VFL application scenarios. This is because these traditional methods heavily rely on specific model structures and/or have strict limitations on application scenarios. To address this, our study introduces the Unified InverNet Framework into VFL, which yields a novel and flexible approach (dubbed UIFV) that leverages intermediate feature data to reconstruct original data, instead of relying on gradients or model details. The intermediate feature data is the feature exchanged by different participants during the inference phase of VFL. Experiments on four datasets demonstrate that our methods significantly outperform state-of-the-art techniques in attack precision. Our work exposes severe privacy vulnerabilities within VFL systems that pose real threats to practical VFL applications and thus confirms the necessity of further enhancing privacy protection in the VFL architecture.

6/19/2024

cs.LG cs.AI cs.CR stat.ML