VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values

Read original: arXiv:2210.17228 - Published 5/22/2024 by Florian van Daalen, Lianne Ippel, Andre Dekker, Inigo Bermejo

🌐

Overview

Federated learning allows training a machine learning model on decentralized data
Bayesian networks are probabilistic models used in AI, known for their interpretability and usefulness for decision support
While some research has been done on federated learning of Bayesian networks, there are limitations in handling vertically partitioned or heterogeneous data and missing values

Plain English Explanation

Federated learning is a way to train machine learning models without centralizing all the data. Instead, the data stays distributed across different locations, and the model is trained by sharing updates between these locations. This is useful when the data can't be easily combined, like medical records across different hospitals.

Bayesian networks are a type of model that can represent relationships between different factors probabilistically. They're often used in AI applications because they can combine expert knowledge with data, and their inner workings are easy to understand. This makes them helpful for decision-making, like in healthcare.

Some research has looked at training Bayesian networks in a federated setting. However, there are still challenges when the data is split vertically (different variables in different datasets) or has missing values. This paper proposes a new method called VertiBayes to address these issues.

Technical Explanation

The paper introduces a novel method called VertiBayes to train Bayesian network models on vertically partitioned data, which can handle missing values.

For structure learning, the authors adapted the K2 algorithm and used a privacy-preserving scalar product protocol. This allows learning the network structure without centralizing the data.

For parameter learning, VertiBayes uses a two-step approach. First, it learns an intermediate model using maximum likelihood, treating missing values as a special value. Then, it trains the final model on synthetic data generated by the intermediate model, using the Expectation-Maximization (EM) algorithm.

The privacy guarantees of VertiBayes are equivalent to the privacy-preserving scalar product protocol used. The authors experimentally show that VertiBayes produces models comparable to those learned using traditional algorithms, and they estimate the increase in complexity as the number of samples, network size, and data complexity increase.

The paper also proposes two alternative approaches to estimate the performance of the model using vertically partitioned data, and shows that these lead to reasonably accurate estimates.

Critical Analysis

The paper addresses an important challenge in federated learning - training Bayesian networks on vertically partitioned data with missing values. This is a common scenario in real-world applications, such as healthcare, where data is spread across different institutions.

The proposed VertiBayes method seems promising, as it maintains model performance while addressing the key limitations of previous approaches. The authors provide a thorough experimental evaluation and analysis of the method's scalability and complexity.

One potential limitation is the reliance on the privacy-preserving scalar product protocol, which may introduce computational overhead or have its own privacy guarantees that need to be carefully considered. Additionally, the paper does not discuss the impact of the amount of missing data or the distribution of missing values on the method's performance.

Further research could explore extensions of VertiBayes, such as handling non-Gaussian distributions or improving the estimation of model performance in the vertically partitioned setting.

Conclusion

This paper presents a novel method called VertiBayes that enables the training of Bayesian network models on vertically partitioned data with missing values in a federated learning setting. By addressing key limitations of previous approaches, VertiBayes can produce high-quality models while preserving privacy. The proposed techniques have the potential to significantly expand the applicability of Bayesian networks in domains like healthcare, where data is often decentralized and may contain missing values.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values

Florian van Daalen, Lianne Ippel, Andre Dekker, Inigo Bermejo

Federated learning makes it possible to train a machine learning model on decentralized data. Bayesian networks are probabilistic graphical models that have been widely used in artificial intelligence applications. Their popularity stems from the fact they can be built by combining existing expert knowledge with data and are highly interpretable, which makes them useful for decision support, e.g. in healthcare. While some research has been published on the federated learning of Bayesian networks, publications on Bayesian networks in a vertically partitioned or heterogeneous data setting (where different variables are located in different datasets) are limited, and suffer from important omissions, such as the handling of missing data. In this article, we propose a novel method called VertiBayes to train Bayesian networks (structure and parameters) on vertically partitioned data, which can handle missing values as well as an arbitrary number of parties. For structure learning we adapted the widely used K2 algorithm with a privacy-preserving scalar product protocol. For parameter learning, we use a two-step approach: first, we learn an intermediate model using maximum likelihood by treating missing values as a special value and then we train a model on synthetic data generated by the intermediate model using the EM algorithm. The privacy guarantees of our approach are equivalent to the ones provided by the privacy preserving scalar product protocol used. We experimentally show our approach produces models comparable to those learnt using traditional algorithms and we estimate the increase in complexity in terms of samples, network size, and complexity. Finally, we propose two alternative approaches to estimate the performance of the model using vertically partitioned data and we show in experiments that they lead to reasonably accurate estimates.

5/22/2024

📊

Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference

Conor Hassan, Matthew Sutton, Antonietta Mira, Kerrie Mengersen

Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains.

5/8/2024

🏋️

FedCVT: Semi-supervised Vertical Federated Learning with Cross-view Training

Yan Kang, Yang Liu, Xinle Liang

Federated learning allows multiple parties to build machine learning models collaboratively without exposing data. In particular, vertical federated learning (VFL) enables participating parties to build a joint machine learning model based on distributed features of aligned samples. However, VFL requires all parties to share a sufficient amount of aligned samples. In reality, the set of aligned samples may be small, leaving the majority of the non-aligned data unused. In this article, we propose Federated Cross-view Training (FedCVT), a semi-supervised learning approach that improves the performance of the VFL model with limited aligned samples. More specifically, FedCVT estimates representations for missing features, predicts pseudo-labels for unlabeled samples to expand the training set, and trains three classifiers jointly based on different views of the expanded training set to improve the VFL model's performance. FedCVT does not require parties to share their original data and model parameters, thus preserving data privacy. We conduct experiments on NUS-WIDE, Vehicle, and CIFAR10 datasets. The experimental results demonstrate that FedCVT significantly outperforms vanilla VFL that only utilizes aligned samples. Finally, we perform ablation studies to investigate the contribution of each component of FedCVT to the performance of FedCVT. Code is available at https://github.com/yankang18/FedCVT

6/18/2024

📊

Variational Bayes for Federated Continual Learning

Dezhong Yao, Sanmu Li, Yutong Dai, Zhiqiang Xu, Shengshan Hu, Peilin Zhao, Lichao Sun

Federated continual learning (FCL) has received increasing attention due to its potential in handling real-world streaming data, characterized by evolving data distributions and varying client classes over time. The constraints of storage limitations and privacy concerns confine local models to exclusively access the present data within each learning cycle. Consequently, this restriction induces performance degradation in model training on previous data, termed catastrophic forgetting. However, existing FCL approaches need to identify or know changes in data distribution, which is difficult in the real world. To release these limitations, this paper directs attention to a broader continuous framework. Within this framework, we introduce Federated Bayesian Neural Network (FedBNN), a versatile and efficacious framework employing a variational Bayesian neural network across all clients. Our method continually integrates knowledge from local and historical data distributions into a single model, adeptly learning from new data distributions while retaining performance on historical distributions. We rigorously evaluate FedBNN's performance against prevalent methods in federated learning and continual learning using various metrics. Experimental analyses across diverse datasets demonstrate that FedBNN achieves state-of-the-art results in mitigating forgetting.

5/24/2024