Towards Split Learning-based Privacy-Preserving Record Linkage

Read original: arXiv:2409.01088 - Published 9/4/2024 by Michail Zervas, Alexandros Karakasidis

Towards Split Learning-based Privacy-Preserving Record Linkage

Overview

This paper explores a privacy-preserving record linkage method using split learning.
Split learning allows models to be trained collaboratively without sharing raw data.
The approach aims to link records across databases while preserving data privacy.

Plain English Explanation

The paper presents a new way to link records across different databases without compromising people's privacy. Traditional record linkage methods often require sharing raw data, which can be a privacy concern.

Instead, this approach uses split learning, a technique where the model is split into two parts. One part runs on the data owner's side, and the other runs on a remote server. This allows the model to be trained collaboratively without exposing the raw data.

The key idea is to use support vector machines (SVMs) as the machine learning algorithm. SVMs can be adapted for split learning, enabling privacy-preserving record linkage. By protecting the split learning process and using vision transformers, the approach aims to securely link records across databases while respecting individual privacy.

Technical Explanation

The paper proposes a privacy-preserving record linkage method based on split learning. Split learning allows machine learning models to be trained collaboratively without directly sharing raw data between parties.

The approach uses support vector machines (SVMs) as the underlying machine learning algorithm. SVMs can be adapted for split learning, where the model is divided into two parts. One part runs on the data owner's side, and the other runs on a remote server. This enables the model to be trained without exposing the raw data.

The paper provides a detailed description of the split learning-based record linkage process. It includes the algorithm design, data preparation, and model training steps. Experiments are conducted to evaluate the performance and privacy properties of the proposed method.

The results show that the split learning-based approach can achieve comparable record linkage accuracy to traditional methods while preserving data privacy. The paper also discusses potential limitations and areas for future research, such as extending the method to handle more complex data types and exploring alternative machine learning algorithms for split learning.

Critical Analysis

The paper presents a promising approach for privacy-preserving record linkage using split learning. By leveraging the capabilities of support vector machines, the method can link records across databases without directly sharing raw data. This is a valuable contribution, as traditional record linkage techniques often raise privacy concerns.

One potential limitation is the reliance on SVMs, which may not be the optimal machine learning algorithm for all types of data and linkage tasks. The paper acknowledges this and suggests exploring alternative algorithms, such as deep learning models, in future work.

Moreover, the paper focuses on a specific split learning approach and does not provide a comprehensive comparison to other privacy-preserving record linkage techniques. Assessing the method's performance and privacy guarantees relative to other state-of-the-art approaches could further strengthen the research.

Despite these minor considerations, the paper makes a significant step forward in addressing the privacy challenges associated with record linkage. The split learning-based method offers a practical solution that balances the need for accurate linkage and the protection of sensitive data.

Conclusion

This paper introduces a novel privacy-preserving record linkage method based on split learning. By adapting support vector machines for a split learning framework, the approach enables secure linkage of records across databases without directly sharing raw data.

The experimental results demonstrate the effectiveness of the proposed technique in terms of linkage accuracy and privacy preservation. While further research is needed to explore alternative machine learning algorithms and compare the method with other state-of-the-art approaches, this work represents an important contribution to the field of privacy-preserving data integration.

The split learning-based record linkage method has the potential to facilitate valuable data-driven insights while respecting individual privacy, which is a critical concern in many domains. As data sharing and integration become increasingly important, this research provides a promising direction for developing privacy-preserving solutions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Towards Split Learning-based Privacy-Preserving Record Linkage

Michail Zervas, Alexandros Karakasidis

Split Learning has been recently introduced to facilitate applications where user data privacy is a requirement. However, it has not been thoroughly studied in the context of Privacy-Preserving Record Linkage, a problem in which the same real-world entity should be identified among databases from different dataholders, but without disclosing any additional information. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching, by introducing a novel training method through the utilization of Reference Sets, which are publicly available data corpora, showcasing minimal matching impact against a traditional centralized SVM-based technique.

9/4/2024

📊

Split Learning without Local Weight Sharing to Enhance Client-side Data Privacy

Ngoc Duy Pham, Tran Khoa Phan, Alsharif Abuadbba, Yansong Gao, Doan Nguyen, Naveen Chilamkurti

Split learning (SL) aims to protect user data privacy by distributing deep models between client-server and keeping private data locally. In SL training with multiple clients, the local model weights are shared among the clients for local model update. This paper first reveals data privacy leakage exacerbated from local weight sharing among the clients in SL through model inversion attacks. Then, to reduce the data privacy leakage issue, we propose and analyze privacy-enhanced SL (P-SL) (or SL without local weight sharing). We further propose parallelized P-SL to expedite the training process by duplicating multiple server-side model instances without compromising accuracy. Finally, we explore P-SL with late participating clients and devise a server-side cache-based training method to address the forgetting phenomenon in SL when late clients join. Experimental results demonstrate that P-SL helps reduce up to 50% of client-side data leakage, which essentially achieves a better privacy-accuracy trade-off than the current trend by using differential privacy mechanisms. Moreover, P-SL and its cache-based version achieve comparable accuracy to baseline SL under various data distributions, while cost less computation and communication. Additionally, caching-based training in P-SL mitigates the negative effect of forgetting, stabilizes the learning, and enables practical and low-complexity training in a dynamic environment with late-arriving clients.

7/23/2024

New!Enhancing Privacy in ControlNet and Stable Diffusion via Split Learning

Dixi Yao

With the emerging trend of large generative models, ControlNet is introduced to enable users to fine-tune pre-trained models with their own data for various use cases. A natural question arises: how can we train ControlNet models while ensuring users' data privacy across distributed devices? Exploring different distributed training schemes, we find conventional federated learning and split learning unsuitable. Instead, we propose a new distributed learning structure that eliminates the need for the server to send gradients back. Through a comprehensive evaluation of existing threats, we discover that in the context of training ControlNet with split learning, most existing attacks are ineffective, except for two mentioned in previous literature. To counter these threats, we leverage the properties of diffusion models and design a new timestep sampling policy during forward processes. We further propose a privacy-preserving activation function and a method to prevent private text prompts from leaving clients, tailored for image generation with diffusion models. Our experimental results demonstrate that our algorithms and systems greatly enhance the efficiency of distributed training for ControlNet while ensuring users' data privacy without compromising image generation quality.

9/16/2024

👀

Make Split, not Hijack: Preventing Feature-Space Hijacking Attacks in Split Learning

Tanveer Khan, Mindaugas Budzys, Antonis Michalas

The popularity of Machine Learning (ML) makes the privacy of sensitive data more imperative than ever. Collaborative learning techniques like Split Learning (SL) aim to protect client data while enhancing ML processes. Though promising, SL has been proved to be vulnerable to a plethora of attacks, thus raising concerns about its effectiveness on data privacy. In this work, we introduce a hybrid approach combining SL and Function Secret Sharing (FSS) to ensure client data privacy. The client adds a random mask to the activation map before sending it to the servers. The servers cannot access the original function but instead work with shares generated using FSS. Consequently, during both forward and backward propagation, the servers cannot reconstruct the client's raw data from the activation map. Furthermore, through visual invertibility, we demonstrate that the server is incapable of reconstructing the raw image data from the activation map when using FSS. It enhances privacy by reducing privacy leakage compared to other SL-based approaches where the server can access client input information. Our approach also ensures security against feature space hijacking attack, protecting sensitive information from potential manipulation. Our protocols yield promising results, reducing communication overhead by over 2x and training time by over 7x compared to the same model with FSS, without any SL. Also, we show that our approach achieves >96% accuracy and remains equivalent to the plaintext models.

4/16/2024