Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control

2212.00484

Published 4/24/2024 by T^ania Carvalho, Nuno Moniz, Lu'is Antunes, Nitesh Chawla

📊

Abstract

Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, all of them have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose $epsilon$-PrivateSMOTE, a technique designed for safeguarding against re-identification and linkage attacks, particularly addressing cases with a high sloppy re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases. We demonstrate how $epsilon$-PrivateSMOTE is capable of achieving competitive results in privacy risk and better predictive performance when compared to multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines. We also show how our method improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialised hardware.

Create account to get full access

Overview

The paper proposes a new technique called $\epsilon$-PrivateSMOTE for generating synthetic data that protects against re-identification and linkage attacks.
The method combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases.
The authors demonstrate that $\epsilon$-PrivateSMOTE achieves competitive results in privacy risk and better predictive performance compared to traditional and state-of-the-art privacy-preservation methods.
The technique also improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialized hardware.

Plain English Explanation

Protecting people's private data is a critical challenge. Typical techniques for transforming data to protect privacy can be time-consuming or require significant computing power. Recent deep learning-based solutions also have drawbacks, as they may undermine the usefulness of the data.

The researchers propose a new method called $\epsilon$-PrivateSMOTE that aims to address these issues. It generates synthetic data by adding noise to existing data points, creating new points that resemble the original data but with added privacy protection. This is combined with a technique called differential privacy, which further obfuscates high-risk cases where a person's identity could be revealed.

The researchers show that $\epsilon$-PrivateSMOTE can achieve strong privacy protection while maintaining the usefulness of the data for tasks like making predictions. Importantly, it is also much faster and more efficient than other approaches, requiring far less computing resources. This makes it a practical solution for organizations that need to protect people's privacy while still being able to use the data for valuable applications.

Technical Explanation

The proposed $\epsilon$-PrivateSMOTE technique combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases and protect against re-identification and linkage attacks.

The method works by first identifying high-risk data points that are prone to re-identification. It then generates new synthetic data points by interpolating between existing data points and adding noise. This creates new points that resemble the original data but are less susceptible to attacks.

The technique also incorporates differential privacy to further enhance the privacy protection. Differential privacy is a mathematical framework that quantifies the privacy risk of disclosing information about individuals in a dataset.

The authors demonstrate that $\epsilon$-PrivateSMOTE outperforms multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines, in terms of both privacy risk and predictive performance.

Additionally, the researchers show that $\epsilon$-PrivateSMOTE is a resource-efficient solution that improves time requirements by at least a factor of 9 compared to other approaches, without requiring specialized hardware. This makes it a practical and scalable option for organizations that need to protect sensitive data.

Critical Analysis

The paper provides a promising approach for generating synthetic data that protects against re-identification and linkage attacks. The combination of noise-induced interpolation and differential privacy principles appears to be an effective way to balance data utility and privacy protection.

However, the authors acknowledge that their method may still have some limitations. For example, the performance of $\epsilon$-PrivateSMOTE may depend on the specific dataset and the characteristics of the high-risk data points. Further research may be needed to understand the robustness of the technique across a wider range of scenarios.

Additionally, the paper does not explore the potential for adapting diffusion models to private data generation, which could be an interesting area for future work. Diffusion models have shown promising results in synthetic data generation, and incorporating differential privacy principles could further enhance their privacy-preserving capabilities.

Overall, the $\epsilon$-PrivateSMOTE technique represents a valuable contribution to the field of privacy-preserving data generation. Its efficiency and effectiveness in maintaining data utility while protecting against re-identification make it a promising solution for organizations that need to balance these competing priorities.

Conclusion

The paper presents a novel technique called $\epsilon$-PrivateSMOTE that combines synthetic data generation with differential privacy principles to protect against re-identification and linkage attacks. The authors demonstrate that this approach achieves competitive results in privacy risk and better predictive performance compared to traditional and state-of-the-art privacy-preservation methods.

Importantly, $\epsilon$-PrivateSMOTE is a resource-efficient solution that significantly improves time requirements and can be implemented without specialized hardware. This makes it a practical and scalable option for organizations that need to protect sensitive data while still being able to use it for valuable applications.

Overall, the $\epsilon$-PrivateSMOTE technique represents an important step forward in the field of privacy-preserving data generation, offering a balanced and efficient solution that can help safeguard people's private information while preserving the utility of the data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Differentially Private Synthetic Data with Private Density Estimation

Nikolija Bojkovic, Po-Ling Loh

The need to analyze sensitive data, such as medical records or financial data, has created a critical research challenge in recent years. In this paper, we adopt the framework of differential privacy, and explore mechanisms for generating an entire dataset which accurately captures characteristics of the original data. We build upon the work of Boedihardjo et al, which laid the foundations for a new optimization-based algorithm for generating private synthetic data. Importantly, we adapt their algorithm by replacing a uniform sampling step with a private distribution estimator; this allows us to obtain better computational guarantees for discrete distributions, and develop a novel algorithm suitable for continuous distributions. We also explore applications of our work to several statistical tasks.

5/9/2024

cs.CR cs.IT cs.LG stat.ML

Synthetic Data Outliers: Navigating Identity Disclosure

Carolina Trindade, Lu'is Antunes, T^ania Carvalho, Nuno Moniz

Multiple synthetic data generation models have emerged, among which deep learning models have become the vanguard due to their ability to capture the underlying characteristics of the original data. However, the resemblance of the synthetic to the original data raises important questions on the protection of individuals' privacy. As synthetic data is perceived as a means to fully protect personal information, most current related work disregards the impact of re-identification risk. In particular, limited attention has been given to exploring outliers, despite their privacy relevance. In this work, we analyze the privacy of synthetic data w.r.t the outliers. Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved. Furthermore, additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.

6/6/2024

cs.LG cs.CR

🖼️

PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining

Kecen Li, Chen Gong, Zhixiang Li, Yuzhong Zhao, Xinwen Hou, Tianhao Wang

Differential Privacy (DP) image data synthesis, which leverages the DP technique to generate synthetic data to replace the sensitive data, allowing organizations to share and utilize synthetic images without privacy concerns. Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data, but suffer from problems of unstable training and massive computational resource demands. This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data, promoting the efficient creation of DP datasets with high fidelity and utility. PRIVIMAGE first establishes a semantic query function using a public dataset. Then, this function assists in querying the semantic distribution of the sensitive dataset, facilitating the selection of data from the public dataset with analogous semantics for pre-training. Finally, we pre-train an image generative model using the selected data and then fine-tune this model on the sensitive dataset using Differentially Private Stochastic Gradient Descent (DP-SGD). PRIVIMAGE allows us to train a lightly parameterized generative model, reducing the noise in the gradient during DP-SGD training and enhancing training stability. Extensive experiments demonstrate that PRIVIMAGE uses only 1% of the public dataset for pre-training and 7.6% of the parameters in the generative model compared to the state-of-the-art method, whereas achieves superior synthetic performance and conserves more computational resources. On average, PRIVIMAGE achieves 30.1% lower FID and 12.6% higher Classification Accuracy than the state-of-the-art method. The replication package and datasets can be accessed online.

4/16/2024

cs.CV cs.CR cs.LG

Differentially Private Fine-Tuning of Diffusion Models

Yu-Lin Tsai, Yizhe Li, Zekai Chen, Po-Yu Chen, Chia-Mu Yu, Xuebin Ren, Francois Buet-Golfouse

The integration of Differential Privacy (DP) with diffusion models (DMs) presents a promising yet challenging frontier, particularly due to the substantial memorization capabilities of DMs that pose significant privacy risks. Differential privacy offers a rigorous framework for safeguarding individual data points during model training, with Differential Privacy Stochastic Gradient Descent (DP-SGD) being a prominent implementation. Diffusion method decomposes image generation into iterative steps, theoretically aligning well with DP's incremental noise addition. Despite the natural fit, the unique architecture of DMs necessitates tailored approaches to effectively balance privacy-utility trade-off. Recent developments in this field have highlighted the potential for generating high-quality synthetic data by pre-training on public data (i.e., ImageNet) and fine-tuning on private data, however, there is a pronounced gap in research on optimizing the trade-offs involved in DP settings, particularly concerning parameter efficiency and model scalability. Our work addresses this by proposing a parameter-efficient fine-tuning strategy optimized for private diffusion models, which minimizes the number of trainable parameters to enhance the privacy-utility trade-off. We empirically demonstrate that our method achieves state-of-the-art performance in DP synthesis, significantly surpassing previous benchmarks on widely studied datasets (e.g., with only 0.47M trainable parameters, achieving a more than 35% improvement over the previous state-of-the-art with a small privacy budget on the CelebA-64 dataset). Anonymous codes available at https://anonymous.4open.science/r/DP-LORA-F02F.

6/4/2024

cs.CV cs.AI cs.CR