Synthetic Data Outliers: Navigating Identity Disclosure

2406.02736

Published 6/6/2024 by Carolina Trindade, Lu'is Antunes, T^ania Carvalho, Nuno Moniz

Synthetic Data Outliers: Navigating Identity Disclosure

Abstract

Multiple synthetic data generation models have emerged, among which deep learning models have become the vanguard due to their ability to capture the underlying characteristics of the original data. However, the resemblance of the synthetic to the original data raises important questions on the protection of individuals' privacy. As synthetic data is perceived as a means to fully protect personal information, most current related work disregards the impact of re-identification risk. In particular, limited attention has been given to exploring outliers, despite their privacy relevance. In this work, we analyze the privacy of synthetic data w.r.t the outliers. Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved. Furthermore, additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.

Create account to get full access

Overview

This paper explores the risks of synthetic data outliers and their potential to disclose individual identities.
The authors investigate how differential privacy techniques used to generate synthetic data can lead to the creation of outlier data points that may reveal sensitive information about individuals.
The paper presents a comprehensive analysis of this challenge and provides insights into navigating the tradeoffs between data utility and privacy protection.

Plain English Explanation

Synthetic data is artificial data that is created to mimic the characteristics of real-world data without revealing sensitive information about the individuals involved. This type of data is often used in machine learning and data analysis to protect the privacy of the original data sources.

However, the authors of this paper have found that the techniques used to generate synthetic data, such as differential privacy, can sometimes lead to the creation of outlier data points that are very different from the rest of the data. These outlier data points may inadvertently reveal sensitive information about the individuals they represent, defeating the purpose of using synthetic data in the first place.

The paper explores this challenge in depth, drawing on related research such as when machine learning models leak and attribute inference attacks. The authors also consider the tradeoffs between preserving data utility and protecting individual privacy, as explored in differentially private synthetic data and the risks of fake data.

Technical Explanation

The paper investigates the challenge of synthetic data outliers, which can arise from the use of differential privacy techniques to generate synthetic data. Differential privacy is a popular method for creating synthetic data that preserves the statistical properties of the original data while protecting individual privacy.

The authors conducted a comprehensive analysis to understand the factors that contribute to the generation of synthetic data outliers and their potential to disclose sensitive information about individuals. They explored the impact of different differential privacy parameters, data characteristics, and machine learning model architectures on the prevalence and severity of these outliers.

The experiments revealed that the choice of differential privacy parameters, such as the privacy budget and noise addition, can significantly influence the likelihood and magnitude of synthetic data outliers. The authors also found that certain data characteristics, such as high-dimensional or skewed distributions, can exacerbate the problem.

Furthermore, the paper examines the relationship between synthetic data outliers and the risk of identity disclosure. The authors demonstrate how these outliers can be used to infer sensitive attributes about individuals, even when the synthetic data is generated with strong privacy guarantees.

Critical Analysis

The paper provides a comprehensive and insightful analysis of the challenges posed by synthetic data outliers. The authors have highlighted an important issue that is often overlooked in the literature on differential privacy and synthetic data generation.

One potential limitation of the research is that it focuses primarily on tabular data and may not fully capture the complexities of other data modalities, such as images or text. Additionally, the paper does not explore the impact of different machine learning model architectures or training techniques on the generation of synthetic data outliers.

Further research could investigate the generalizability of the findings to a wider range of data types and machine learning models. It would also be valuable to explore potential mitigation strategies, such as advanced data sanitization techniques or novel synthetic data generation algorithms, that could address the problem of synthetic data outliers while preserving data utility.

Conclusion

This paper makes a significant contribution to the understanding of the risks associated with synthetic data outliers. The authors have highlighted a critical challenge in the field of differential privacy and synthetic data generation, demonstrating how the techniques used to protect individual privacy can inadvertently lead to the creation of data points that compromise that very same privacy.

The insights provided in this paper are crucial for researchers and practitioners working in the field of data privacy and synthetic data generation. By addressing the issue of synthetic data outliers, the authors have paved the way for the development of more robust and reliable techniques for preserving data utility while protecting individual privacy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📊

Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control

T^ania Carvalho, Nuno Moniz, Lu'is Antunes, Nitesh Chawla

Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, all of them have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose $epsilon$-PrivateSMOTE, a technique designed for safeguarding against re-identification and linkage attacks, particularly addressing cases with a high sloppy re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases. We demonstrate how $epsilon$-PrivateSMOTE is capable of achieving competitive results in privacy risk and better predictive performance when compared to multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines. We also show how our method improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialised hardware.

4/24/2024

cs.LG cs.CR

🏋️

When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Manel Slokom, Peter-Paul de Wolf, Martha Larson

We investigate an attack on a machine learning model that predicts whether a person or household will relocate in the next two years, i.e., a propensity-to-move classifier. The attack assumes that the attacker can query the model to obtain predictions and that the marginal distribution of the data on which the model was trained is publicly available. The attack also assumes that the attacker has obtained the values of non-sensitive attributes for a certain number of target individuals. The objective of the attack is to infer the values of sensitive attributes for these target individuals. We explore how replacing the original data with synthetic data when training the model impacts how successfully the attacker can infer sensitive attributes.

5/21/2024

cs.LG

🤯

A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data

Meenatchi Sundaram Muthu Selva Annamalai, Andrea Gadotti, Luc Rocher

Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate artificial data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.

5/10/2024

cs.LG cs.CR

🏅

Real Risks of Fake Data: Synthetic Data, Diversity-Washing and Consent Circumvention

Cedric Deslandes Whitney, Justin Norman

Machine learning systems require representations of the real world for training and testing - they require data, and lots of it. Collecting data at scale has logistical and ethical challenges, and synthetic data promises a solution to these challenges. Instead of needing to collect photos of real people's faces to train a facial recognition system, a model creator could create and use photo-realistic, synthetic faces. The comparative ease of generating this synthetic data rather than relying on collecting data has made it a common practice. We present two key risks of using synthetic data in model development. First, we detail the high risk of false confidence when using synthetic data to increase dataset diversity and representation. We base this in the examination of a real world use-case of synthetic data, where synthetic datasets were generated for an evaluation of facial recognition technology. Second, we examine how using synthetic data risks circumventing consent for data usage. We illustrate this by considering the importance of consent to the U.S. Federal Trade Commission's regulation of data collection and affected models. Finally, we discuss how these two risks exemplify how synthetic data complicates existing governance and ethical practice; by decoupling data from those it impacts, synthetic data is prone to consolidating power away those most impacted by algorithmically-mediated harm.

5/6/2024

cs.CY cs.AI cs.CV