Vertical Federated Learning for Effectiveness, Security, Applicability: A Survey

2405.17495

Published 6/5/2024 by Mang Ye, Wei Shen, Bo Du, Eduard Snezhko, Vassili Kovalev, Pong C. Yuen

Vertical Federated Learning for Effectiveness, Security, Applicability: A Survey

Abstract

Vertical Federated Learning (VFL) is a privacy-preserving distributed learning paradigm where different parties collaboratively learn models using partitioned features of shared samples, without leaking private data. Recent research has shown promising results addressing various challenges in VFL, highlighting its potential for practical applications in cross-domain collaboration. However, the corresponding research is scattered and lacks organization. To advance VFL research, this survey offers a systematic overview of recent developments. First, we provide a history and background introduction, along with a summary of the general training protocol of VFL. We then revisit the taxonomy in recent reviews and analyze limitations in-depth. For a comprehensive and structured discussion, we synthesize recent research from three fundamental perspectives: effectiveness, security, and applicability. Finally, we discuss several critical future research directions in VFL, which will facilitate the developments in this field. We provide a collection of research lists and periodically update them at https://github.com/shentt67/VFL_Survey.

Create account to get full access

Overview

This paper provides a comprehensive survey of vertical federated learning (VFL), a machine learning approach that enables multiple parties to train a shared model without directly sharing their private data.
The authors discuss the effectiveness, security, and applicability of VFL, highlighting its advantages over traditional federated learning and its potential to address the challenges of data privacy and model performance.
The survey covers the key concepts, architectures, and algorithms used in VFL, as well as recent advancements and practical considerations for implementing VFL in real-world scenarios.

Plain English Explanation

Vertical federated learning (VFL) is a way for different organizations or individuals to work together to train a machine learning model without having to share their private data. In traditional federated learning, multiple parties would share the same type of data, but in VFL, the parties have different types of data that can be combined to train a more powerful model.

For example, imagine a bank and an online retailer wanted to build a model to predict customer churn. The bank might have information about the customer's financial history, while the retailer might have data on the customer's purchasing behavior. By using VFL, the two organizations could train a model that takes advantage of both datasets without either party having to share their sensitive customer information.

The key benefit of VFL is that it allows for more effective and secure machine learning by leveraging diverse data sources while protecting individual privacy. The VFL research library and benchmark and contributions for evaluating VFL are two examples of how researchers are advancing the field of VFL.

Some other interesting VFL techniques covered in the survey include hybrid local pre-training, scalable VFL via data augmentation, and improving privacy-preserving in VFL. These approaches aim to make VFL more effective, secure, and widely applicable in real-world scenarios.

Technical Explanation

The paper begins by providing a comprehensive overview of vertical federated learning (VFL), which enables multiple parties with different types of data to collaborate on training a shared machine learning model without directly sharing their private data. This is in contrast to traditional federated learning, where parties have the same type of data.

The authors then delve into the key concepts and architectures used in VFL, including feature-level and instance-level VFL. They also discuss various VFL algorithms, such as those for model aggregation, data alignment, and privacy preservation. The paper covers recent advancements in the field, including the VFL research library and benchmark, which provides a standardized framework for evaluating VFL techniques, and the contributions for evaluating VFL, which highlight the key research directions and challenges in the field.

Additionally, the survey examines specific VFL techniques, such as hybrid local pre-training, which combines local and federated training to improve model performance, scalable VFL via data augmentation, which addresses the scalability challenges of VFL, and improving privacy-preserving in VFL, which focuses on enhancing the privacy guarantees of VFL.

Critical Analysis

The paper provides a comprehensive and well-structured overview of the field of vertical federated learning, highlighting its key advantages and the latest advancements in the area. The authors have done a commendable job in covering the technical details and research directions while maintaining a clear and accessible writing style.

One potential area for further exploration is the practical implementation and deployment challenges of VFL, as the survey primarily focuses on the theoretical and algorithmic aspects. The authors do mention some practical considerations, such as data alignment and scalability, but a deeper discussion on the practical challenges and best practices for real-world VFL deployment could further enhance the value of the survey.

Additionally, the authors could have delved deeper into the potential limitations and drawbacks of VFL, such as the impact of data heterogeneity, the complexity of coordinating multiple parties, and the potential for model bias or fairness issues. A more in-depth critical analysis of these aspects would help readers to better understand the trade-offs and considerations when adopting VFL.

Conclusion

This comprehensive survey on vertical federated learning provides a valuable resource for researchers, practitioners, and policymakers interested in understanding the latest advancements and potential of this emerging field. By highlighting the effectiveness, security, and applicability of VFL, the authors have made a strong case for the importance of this approach in addressing the challenges of data privacy and model performance in the age of distributed and diverse data sources.

The survey's coverage of key concepts, architectures, algorithms, and practical techniques, as well as the critical analysis and future research directions, make it a must-read for anyone seeking to stay informed on the cutting edge of machine learning and data privacy. As the adoption of VFL continues to grow, this survey will undoubtedly serve as a valuable reference for researchers and developers working to push the boundaries of secure and effective collaborative learning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📉

VFLAIR: A Research Library and Benchmark for Vertical Federated Learning

Tianyuan Zou, Zixuan Gu, Yu He, Hideaki Takahashi, Yang Liu, Ya-Qin Zhang

Vertical Federated Learning (VFL) has emerged as a collaborative training paradigm that allows participants with different features of the same group of users to accomplish cooperative training without exposing their raw data or model parameters. VFL has gained significant attention for its research potential and real-world applications in recent years, but still faces substantial challenges, such as in defending various kinds of data inference and backdoor attacks. Moreover, most of existing VFL projects are industry-facing and not easily used for keeping track of the current research progress. To address this need, we present an extensible and lightweight VFL framework VFLAIR (available at https://github.com/FLAIR-THU/VFLAIR), which supports VFL training with a variety of models, datasets and protocols, along with standardized modules for comprehensive evaluations of attacks and defense strategies. We also benchmark 11 attacks and 8 defenses performance under different communication and model partition settings and draw concrete insights and recommendations on the choice of defense strategies for different practical VFL deployment scenarios.

4/17/2024

cs.LG

✅

A Survey on Contribution Evaluation in Vertical Federated Learning

Yue Cui, Chung-ju Huang, Yuzhu Zhang, Leye Wang, Lixin Fan, Xiaofang Zhou, Qiang Yang

Vertical Federated Learning (VFL) has emerged as a critical approach in machine learning to address privacy concerns associated with centralized data storage and processing. VFL facilitates collaboration among multiple entities with distinct feature sets on the same user population, enabling the joint training of predictive models without direct data sharing. A key aspect of VFL is the fair and accurate evaluation of each entity's contribution to the learning process. This is crucial for maintaining trust among participating entities, ensuring equitable resource sharing, and fostering a sustainable collaboration framework. This paper provides a thorough review of contribution evaluation in VFL. We categorize the vast array of contribution evaluation techniques along the VFL lifecycle, granularity of evaluation, privacy considerations, and core computational methods. We also explore various tasks in VFL that involving contribution evaluation and analyze their required evaluation properties and relation to the VFL lifecycle phases. Finally, we present a vision for the future challenges of contribution evaluation in VFL. By providing a structured analysis of the current landscape and potential advancements, this paper aims to guide researchers and practitioners in the design and implementation of more effective, efficient, and privacy-centric VFL solutions. Relevant literature and open-source resources have been compiled and are being continuously updated at the GitHub repository: url{https://github.com/cuiyuebing/VFL_CE}.

5/7/2024

cs.LG cs.DC

Vertical Federated Learning Hybrid Local Pre-training

Wenguo Li, Xinling Guo, Xu Jiao, Tiancheng Huang, Xiaoran Yan, Yao Yang

Vertical Federated Learning (VFL), which has a broad range of real-world applications, has received much attention in both academia and industry. Enterprises aspire to exploit more valuable features of the same users from diverse departments to boost their model prediction skills. VFL addresses this demand and concurrently secures individual parties from exposing their raw data. However, conventional VFL encounters a bottleneck as it only leverages aligned samples, whose size shrinks with more parties involved, resulting in data scarcity and the waste of unaligned data. To address this problem, we propose a novel VFL Hybrid Local Pre-training (VFLHLP) approach. VFLHLP first pre-trains local networks on the local data of participating parties. Then it utilizes these pre-trained networks to adjust the sub-model for the labeled party or enhance representation learning for other parties during downstream federated learning on aligned data, boosting the performance of federated models. The experimental results on real-world advertising datasets, demonstrate that our approach achieves the best performance over baseline methods by large margins. The ablation study further illustrates the contribution of each technique in VFLHLP to its overall performance.

5/22/2024

cs.LG cs.DC

UIFV: Data Reconstruction Attack in Vertical Federated Learning

Jirui Yang, Peng Chen, Zhihui Lu, Qiang Duan, Yubing Bao

Vertical Federated Learning (VFL) facilitates collaborative machine learning without the need for participants to share raw private data. However, recent studies have revealed privacy risks where adversaries might reconstruct sensitive features through data leakage during the learning process. Although data reconstruction methods based on gradient or model information are somewhat effective, they reveal limitations in VFL application scenarios. This is because these traditional methods heavily rely on specific model structures and/or have strict limitations on application scenarios. To address this, our study introduces the Unified InverNet Framework into VFL, which yields a novel and flexible approach (dubbed UIFV) that leverages intermediate feature data to reconstruct original data, instead of relying on gradients or model details. The intermediate feature data is the feature exchanged by different participants during the inference phase of VFL. Experiments on four datasets demonstrate that our methods significantly outperform state-of-the-art techniques in attack precision. Our work exposes severe privacy vulnerabilities within VFL systems that pose real threats to practical VFL applications and thus confirms the necessity of further enhancing privacy protection in the VFL architecture.

6/19/2024

cs.LG cs.AI cs.CR stat.ML