A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data

2301.10053

Published 5/10/2024 by Meenatchi Sundaram Muthu Selva Annamalai, Andrea Gadotti, Luc Rocher

🤯

Abstract

Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate artificial data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.

Create account to get full access

Overview

Recent advances in synthetic data generation (SDG) have been proposed as a solution to sharing sensitive data while protecting privacy.
SDG aims to learn the statistical properties of real data and generate artificial data that is structurally and statistically similar.
However, prior research suggests that inference attacks on synthetic data can undermine privacy, mainly for specific outlier records.
This paper introduces a new attribute inference attack against synthetic data that targets all records, not just outliers.
The attack is based on linear reconstruction methods for aggregate statistics and is evaluated on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms.
The authors define a formal privacy game to show that their attack can be highly accurate on arbitrary records, due to individual information leakage rather than just population-level inference.
They then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility for these SDG methods.

Plain English Explanation

Imagine you have sensitive data, like people's medical records or financial information, that you want to share with researchers or other parties. You can't just share the real data because that would violate people's privacy. So, researchers have been working on a technique called synthetic data generation (SDG) to create fake data that looks and behaves like the real thing, but without revealing any individual's private information.

The idea behind SDG is to analyze the statistical patterns in the real data and then use that information to generate new, artificial data that has similar properties. This synthetic data can then be shared instead of the original sensitive information.

However, the paper shows that even though the synthetic data may look realistic, there are ways for attackers to figure out information about the individuals in the real data. The researchers introduce a new kind of attack that can accurately infer details about any record in the synthetic dataset, not just the outliers or unusual cases.

By defining a formal "privacy game," the researchers demonstrate that this attack works because the synthetic data still leaks information about individual people, even if it looks realistic overall. They then explore how to balance protecting people's privacy with maintaining the usefulness of the synthetic data for analysis and research.

The paper finds that current SDG methods struggle to consistently provide strong privacy protection while also keeping the synthetic data useful. The best approach they tested, a technique called differential privacy, can work well in certain situations, but has limitations. The researchers also show that creating more synthetic data records can improve usefulness but also make the attacks even more effective.

Technical Explanation

This paper introduces a new attribute inference attack against synthetic data generated by state-of-the-art algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms.

Unlike prior attacks that primarily targeted outlier records, this new attack uses linear reconstruction methods for aggregate statistics to infer information about all records in the synthetic dataset, not just the unusual ones. The authors define a formal "privacy game" to show that this attack can achieve high accuracy on arbitrary records, due to individual-level information leakage rather than just population-level inference.

The researchers then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility for these SDG algorithms. They find that current methods cannot consistently provide sufficient privacy protection against their inference attack while also retaining reasonable utility.

The differentially private SDG mechanism evaluated performs the best, providing both privacy protection and reasonable utility, but only in very specific settings. The authors also show that generating a larger number of synthetic records can improve utility but significantly increases the effectiveness of their inference attack.

Critical Analysis

The paper makes a valuable contribution by introducing a new, powerful attribute inference attack that targets all records in a synthetic dataset, not just outliers. This represents an important advancement over prior attacks that were limited to specific types of records.

However, the authors acknowledge several limitations and areas for further research. For example, they note that their attack assumes access to the full synthetic dataset, which may not always be the case in real-world settings. Exploring attacks with partial access or other adversarial assumptions could provide additional insights.

Additionally, while the differentially private SDG mechanism shows promise, the authors suggest that its utility-privacy tradeoffs may still be insufficient for many practical applications. Developing more advanced privacy-preserving SDG techniques, perhaps drawing inspiration from recent work in this area, could help address this challenge.

It would also be valuable to investigate the real-world implications and potential societal impacts of these privacy attacks on synthetic data. Understanding the broader context and consequences of such vulnerabilities is crucial for informing the responsible development and deployment of SDG systems.

Conclusion

This paper highlights the significant challenge of achieving both strong privacy protection and high statistical utility when generating synthetic data. The introduction of a new attribute inference attack that can accurately target all records, not just outliers, underscores the need for more robust and effective SDG methods.

While differentially private approaches show promise, the authors' findings suggest that current techniques may still fall short of providing the necessary balance between privacy and utility. Continued research and innovation in this area will be crucial for unlocking the full potential of synthetic data to enable data sharing and analysis while safeguarding individual privacy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Data Reconstruction Attacks and Defenses: A Systematic Evaluation

Sheng Liu, Zihan Wang, Yuxiao Chen, Qi Lei

Reconstruction attacks and defenses are essential in understanding the data leakage problem in machine learning. However, prior work has centered around empirical observations of gradient inversion attacks, lacks theoretical justifications, and cannot disentangle the usefulness of defending methods from the computational limitation of attacking methods. In this work, we propose to view the problem as an inverse problem, enabling us to theoretically, quantitatively, and systematically evaluate the data reconstruction problem. On various defense methods, we derived the algorithmic upper bound and the matching (in feature dimension and model width) information-theoretical lower bound on the reconstruction error for two-layer neural networks. To complement the theoretical results and investigate the utility-privacy trade-off, we defined a natural evaluation metric of the defense methods with similar utility loss among the strongest attacks. We further propose a strong reconstruction attack that helps update some previous understanding of the strength of defense methods under our proposed evaluation metric.

6/28/2024

cs.CR cs.LG

Synthetic Data Outliers: Navigating Identity Disclosure

Carolina Trindade, Lu'is Antunes, T^ania Carvalho, Nuno Moniz

Multiple synthetic data generation models have emerged, among which deep learning models have become the vanguard due to their ability to capture the underlying characteristics of the original data. However, the resemblance of the synthetic to the original data raises important questions on the protection of individuals' privacy. As synthetic data is perceived as a means to fully protect personal information, most current related work disregards the impact of re-identification risk. In particular, limited attention has been given to exploring outliers, despite their privacy relevance. In this work, we analyze the privacy of synthetic data w.r.t the outliers. Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved. Furthermore, additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.

6/6/2024

cs.LG cs.CR

📊

The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data

Alexander Decruyenaere, Heidelinde Dehaene, Paloma Rabaey, Christiaan Polet, Johan Decruyenaere, Stijn Vansteelandt, Thomas Demeester

Recent advances in generative models facilitate the creation of synthetic data to be made available for research in privacy-sensitive contexts. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data, whereby synthetic data are treated as if they were actually observed. Before publishing synthetic data, it is essential to develop statistical inference tools for such data. By means of a simulation study, we show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. Despite the use of a previously proposed correction factor, this problem persists for deep generative models, in part due to slower convergence of estimators and resulting underestimation of the true standard error. We further demonstrate our findings through a case study.

6/13/2024

cs.LG stat.ML

🏋️

When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Manel Slokom, Peter-Paul de Wolf, Martha Larson

We investigate an attack on a machine learning model that predicts whether a person or household will relocate in the next two years, i.e., a propensity-to-move classifier. The attack assumes that the attacker can query the model to obtain predictions and that the marginal distribution of the data on which the model was trained is publicly available. The attack also assumes that the attacker has obtained the values of non-sensitive attributes for a certain number of target individuals. The objective of the attack is to infer the values of sensitive attributes for these target individuals. We explore how replacing the original data with synthetic data when training the model impacts how successfully the attacker can infer sensitive attributes.

5/21/2024

cs.LG