FedCVT: Semi-supervised Vertical Federated Learning with Cross-view Training

Read original: arXiv:2008.10838 - Published 6/18/2024 by Yan Kang, Yang Liu, Xinle Liang

🏋️

Overview

Federated learning allows multiple parties to build machine learning models collaboratively without exposing their data.
Vertical federated learning (VFL) enables parties to build a joint model based on distributed features of aligned samples.
VFL requires a sufficient amount of aligned samples, but in reality, the set of aligned samples may be small.
The paper proposes Federated Cross-view Training (FedCVT), a semi-supervised learning approach that improves the performance of the VFL model with limited aligned samples.

Plain English Explanation

Federated learning is a way for multiple organizations to work together on a machine learning model without sharing their private data. In vertical federated learning, the organizations each have different features (or attributes) about the same set of samples, and they can build a joint model using this distributed data. However, VFL requires that the organizations have a sufficient number of samples where the features are aligned - meaning they have the same set of features for those samples.

In reality, the number of aligned samples may be small, leaving a lot of non-aligned data unused. To address this, the researchers propose Federated Cross-view Training (FedCVT). FedCVT is a semi-supervised approach that can improve the VFL model's performance even when there are limited aligned samples.

FedCVT does this by:

Estimating representations for missing features: It can fill in the gaps where one organization is missing certain features for a sample.
Predicting pseudo-labels for unlabeled samples: It can generate labels for samples that don't have true labels, expanding the training data.
Training three classifiers jointly: It trains three different models simultaneously, each focusing on a different "view" of the data, to improve the overall performance.

Importantly, FedCVT does all of this without the organizations having to share their original data or model parameters, preserving data privacy. The researchers show that FedCVT significantly outperforms the basic VFL approach in experiments on several datasets.

Technical Explanation

Vertical federated learning (VFL) enables multiple parties to collaboratively build a joint machine learning model based on distributed features of aligned samples. However, VFL requires all parties to share a sufficient amount of aligned samples. In reality, the set of aligned samples may be small, leaving the majority of the non-aligned data unused.

To address this, the authors propose Federated Cross-view Training (FedCVT), a semi-supervised learning approach that improves the performance of the VFL model with limited aligned samples. FedCVT has three key components:

Feature Imputation: FedCVT estimates representations for missing features using a deep neural network, allowing it to leverage non-aligned samples.
Pseudo-labeling: FedCVT predicts pseudo-labels for unlabeled samples to expand the training set.
Cross-view Training: FedCVT trains three classifiers jointly based on different views of the expanded training set (original features, imputed features, and pseudo-labels) to improve the VFL model's performance.

Importantly, FedCVT does not require parties to share their original data or model parameters, preserving data privacy. The authors evaluate FedCVT on NUS-WIDE, Vehicle, and CIFAR10 datasets, demonstrating that it significantly outperforms the vanilla VFL approach that only utilizes aligned samples.

Critical Analysis

The paper presents a novel and promising approach to improving the performance of vertical federated learning models when the set of aligned samples is limited. By incorporating feature imputation, pseudo-labeling, and cross-view training, FedCVT is able to leverage non-aligned samples and expand the effective training set without compromising data privacy.

However, the authors do acknowledge some limitations of their approach. For example, the performance of FedCVT may degrade if the underlying feature distributions across parties are too different, making accurate feature imputation difficult. Additionally, the pseudo-labeling process could introduce noise, potentially harming model performance if not properly controlled.

Further research could explore ways to adaptively adjust the balance between using imputed features and pseudo-labels, based on the characteristics of the dataset and the parties involved. It would also be valuable to investigate the robustness of FedCVT to various data distribution shifts or adversarial attacks, given the importance of preserving data confidentiality in federated learning settings.

Conclusion

The Federated Cross-view Training (FedCVT) approach proposed in this paper represents a significant advancement in vertical federated learning. By leveraging semi-supervised techniques to overcome the challenge of limited aligned samples, FedCVT can improve the performance of VFL models while preserving data privacy. The experimental results demonstrate the effectiveness of this approach, and the insights gained from this research could inform the development of more robust and versatile federated learning solutions in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

FedCVT: Semi-supervised Vertical Federated Learning with Cross-view Training

Yan Kang, Yang Liu, Xinle Liang

Federated learning allows multiple parties to build machine learning models collaboratively without exposing data. In particular, vertical federated learning (VFL) enables participating parties to build a joint machine learning model based on distributed features of aligned samples. However, VFL requires all parties to share a sufficient amount of aligned samples. In reality, the set of aligned samples may be small, leaving the majority of the non-aligned data unused. In this article, we propose Federated Cross-view Training (FedCVT), a semi-supervised learning approach that improves the performance of the VFL model with limited aligned samples. More specifically, FedCVT estimates representations for missing features, predicts pseudo-labels for unlabeled samples to expand the training set, and trains three classifiers jointly based on different views of the expanded training set to improve the VFL model's performance. FedCVT does not require parties to share their original data and model parameters, thus preserving data privacy. We conduct experiments on NUS-WIDE, Vehicle, and CIFAR10 datasets. The experimental results demonstrate that FedCVT significantly outperforms vanilla VFL that only utilizes aligned samples. Finally, we perform ablation studies to investigate the contribution of each component of FedCVT to the performance of FedCVT. Code is available at https://github.com/yankang18/FedCVT

6/18/2024

Vertical Federated Learning Hybrid Local Pre-training

Wenguo Li, Xinling Guo, Xu Jiao, Tiancheng Huang, Xiaoran Yan, Yao Yang

Vertical Federated Learning (VFL), which has a broad range of real-world applications, has received much attention in both academia and industry. Enterprises aspire to exploit more valuable features of the same users from diverse departments to boost their model prediction skills. VFL addresses this demand and concurrently secures individual parties from exposing their raw data. However, conventional VFL encounters a bottleneck as it only leverages aligned samples, whose size shrinks with more parties involved, resulting in data scarcity and the waste of unaligned data. To address this problem, we propose a novel VFL Hybrid Local Pre-training (VFLHLP) approach. VFLHLP first pre-trains local networks on the local data of participating parties. Then it utilizes these pre-trained networks to adjust the sub-model for the labeled party or enhance representation learning for other parties during downstream federated learning on aligned data, boosting the performance of federated models. The experimental results on real-world advertising datasets, demonstrate that our approach achieves the best performance over baseline methods by large margins. The ablation study further illustrates the contribution of each technique in VFLHLP to its overall performance.

5/22/2024

Vertical Federated Learning for Effectiveness, Security, Applicability: A Survey

Mang Ye, Wei Shen, Bo Du, Eduard Snezhko, Vassili Kovalev, Pong C. Yuen

Vertical Federated Learning (VFL) is a privacy-preserving distributed learning paradigm where different parties collaboratively learn models using partitioned features of shared samples, without leaking private data. Recent research has shown promising results addressing various challenges in VFL, highlighting its potential for practical applications in cross-domain collaboration. However, the corresponding research is scattered and lacks organization. To advance VFL research, this survey offers a systematic overview of recent developments. First, we provide a history and background introduction, along with a summary of the general training protocol of VFL. We then revisit the taxonomy in recent reviews and analyze limitations in-depth. For a comprehensive and structured discussion, we synthesize recent research from three fundamental perspectives: effectiveness, security, and applicability. Finally, we discuss several critical future research directions in VFL, which will facilitate the developments in this field. We provide a collection of research lists and periodically update them at https://github.com/shentt67/VFL_Survey.

6/5/2024

📊

Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference

Conor Hassan, Matthew Sutton, Antonietta Mira, Kerrie Mengersen

Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains.

5/8/2024