Synthetic Census Data Generation via Multidimensional Multiset Sum

2404.10095

Published 4/17/2024 by Cynthia Dwork, Kristjan Greenewald, Manish Raghavan

Synthetic Census Data Generation via Multidimensional Multiset Sum

Abstract

The US Decennial Census provides valuable data for both research and policy purposes. Census data are subject to a variety of disclosure avoidance techniques prior to release in order to preserve respondent confidentiality. While many are interested in studying the impacts of disclosure avoidance methods on downstream analyses, particularly with the introduction of differential privacy in the 2020 Decennial Census, these efforts are limited by a critical lack of data: The underlying microdata, which serve as necessary input to disclosure avoidance methods, are kept confidential. In this work, we aim to address this limitation by providing tools to generate synthetic microdata solely from published Census statistics, which can then be used as input to any number of disclosure avoidance algorithms for the sake of evaluation and carrying out comparisons. We define a principled distribution over microdata given published Census statistics and design algorithms to sample from this distribution. We formulate synthetic data generation in this context as a knapsack-style combinatorial optimization problem and develop novel algorithms for this setting. While the problem we study is provably hard, we show empirically that our methods work well in practice, and we offer theoretical arguments to explain our performance. Finally, we verify that the data we produce are close to the desired ground truth.

Create account to get full access

Overview

This paper presents a new method for generating synthetic census data that preserves the statistical properties of the original data while protecting individual privacy.
The proposed approach, called Multidimensional Multiset Sum (MMS), models the census data as a multidimensional multiset and generates synthetic data by sampling from this representation.
The authors demonstrate the effectiveness of their method on several real-world census datasets, showing that the synthetic data generated can be used for various data analysis tasks while maintaining strong privacy guarantees.

Plain English Explanation

Census data, which contains information about the population, is incredibly valuable for researchers and policymakers. However, sharing this data directly can compromise the privacy of the individuals it represents. This paper introduces a new method to generate synthetic census data that has similar statistical properties to the original data, but without revealing any individual-level information.

The key idea is to model the census data as a multidimensional multiset - a mathematical representation that captures the different attributes (e.g., age, income, location) of the population and the frequency of each combination of these attributes. The authors then develop an algorithm to generate new data by sampling from this multiset representation, ensuring that the resulting synthetic data has the same overall statistical characteristics as the original census data.

This approach offers several advantages over traditional data anonymization techniques, which may still leak sensitive information or distort the data too much. By preserving the underlying statistical patterns, the synthetic data can be used for a wide range of data analysis and modeling tasks, while providing strong privacy guarantees for the individuals in the original census.

Technical Explanation

The authors formulate the problem of synthetic census data generation as a multidimensional multiset summarization task. They model the census data as a multiset, where each element represents a unique combination of attribute values (e.g., age, gender, income) and its frequency corresponds to the number of individuals with that combination.

The proposed Multidimensional Multiset Sum (MMS) algorithm generates synthetic data by sampling from this multiset representation while preserving the overall statistical properties of the original census data. This is achieved by maintaining a set of summary statistics that capture the key characteristics of the multiset, such as the marginal distributions of each attribute and the pairwise correlations between attributes.

The authors evaluate their method on several real-world census datasets, comparing the synthetic data generated by MMS to the original data and to synthetic data produced by other state-of-the-art methods. The results demonstrate that the MMS-generated data retains the essential statistical properties of the original census data, while providing strong privacy guarantees through the multiset-based representation.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated approach for generating synthetic census data that preserves the statistical properties of the original dataset while protecting individual privacy. The authors acknowledge that their method relies on the assumption that the census data can be accurately modeled as a multidimensional multiset, which may not always be the case for more complex or heterogeneous datasets.

Additionally, the authors mention that the current implementation of MMS may not scale well to very large census datasets, as the multiset representation and the computation of summary statistics can become computationally expensive. Further research is needed to improve the scalability of the method and explore its applicability to a wider range of data types and scenarios.

Despite these potential limitations, the paper makes a significant contribution to the field of differentially private data release by introducing a novel and effective approach for synthetic census data generation. The authors' rigorous evaluation and comparison to existing methods demonstrate the merits of their approach and provide valuable insights for future research in this area.

Conclusion

The paper presents a new method called Multidimensional Multiset Sum (MMS) for generating synthetic census data that preserves the statistical properties of the original data while providing strong privacy guarantees. By modeling the census data as a multidimensional multiset and sampling from this representation, the authors demonstrate the ability to produce synthetic data that can be used for a variety of data analysis tasks without compromising individual privacy.

The proposed approach offers a promising solution to the challenge of balancing data utility and privacy in the context of census data sharing, and the authors' thorough evaluation and comparison to existing methods highlight the merits of their work. While the method may face some scalability limitations, the overall contribution of this paper is significant and could have important implications for the field of differentially private data release and the responsible use of sensitive population data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections

Mark Bun, Marco Gaboardi, Marcel Neunhoeffer, Wanrong Zhang

Motivated by privacy concerns in long-term longitudinal studies in medical and social science research, we study the problem of continually releasing differentially private synthetic data from longitudinal data collections. We introduce a model where, in every time step, each individual reports a new data element, and the goal of the synthesizer is to incrementally update a synthetic dataset in a consistent way to capture a rich class of statistical properties. We give continual synthetic data generation algorithms that preserve two basic types of queries: fixed time window queries and cumulative time queries. We show nearly tight upper bounds on the error rates of these algorithms and demonstrate their empirical performance on realistically sized datasets from the U.S. Census Bureau's Survey of Income and Program Participation.

5/28/2024

cs.DS cs.CR cs.CY

Differentially Private Synthetic Data with Private Density Estimation

Nikolija Bojkovic, Po-Ling Loh

The need to analyze sensitive data, such as medical records or financial data, has created a critical research challenge in recent years. In this paper, we adopt the framework of differential privacy, and explore mechanisms for generating an entire dataset which accurately captures characteristics of the original data. We build upon the work of Boedihardjo et al, which laid the foundations for a new optimization-based algorithm for generating private synthetic data. Importantly, we adapt their algorithm by replacing a uniform sampling step with a private distribution estimator; this allows us to obtain better computational guarantees for discrete distributions, and develop a novel algorithm suitable for continuous distributions. We also explore applications of our work to several statistical tasks.

5/9/2024

cs.CR cs.IT cs.LG stat.ML

Minus-One Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity

William H. Press

We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that learns a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected.

6/11/2024

cs.CR cs.CY

Differentially Private Verification of Survey-Weighted Estimates

Tong Lin, Jerome P. Reiter

Several official statistics agencies release synthetic data as public use microdata files. In practice, synthetic data do not admit accurate results for every analysis. Thus, it is beneficial for agencies to provide users with feedback on the quality of their analyses of the synthetic data. One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data. However, such measures leak information about the confidential records, so that agencies may wish to apply disclosure control methods to the released verification measures. We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design. We illustrate the verification measure using repeated sampling simulations where the confidential data are sampled with a probability proportional to size design, and the analyst estimates a population total or mean with the synthetic data. The simulations suggest that the verification measures can provide useful information about the quality of synthetic data inferences.

4/4/2024

cs.CR