Private Geometric Median

2406.07407

Published 6/12/2024 by Mahdi Haghifam, Thomas Steinke, Jonathan Ullman

📊

Abstract

In this paper, we study differentially private (DP) algorithms for computing the geometric median (GM) of a dataset: Given $n$ points, $x_1,dots,x_n$ in $mathbb{R}^d$, the goal is to find a point $theta$ that minimizes the sum of the Euclidean distances to these points, i.e., $sum_{i=1}^{n} |theta - x_i|_2$. Off-the-shelf methods, such as DP-GD, require strong a priori knowledge locating the data within a ball of radius $R$, and the excess risk of the algorithm depends linearly on $R$. In this paper, we ask: can we design an efficient and private algorithm with an excess error guarantee that scales with the (unknown) radius containing the majority of the datapoints? Our main contribution is a pair of polynomial-time DP algorithms for the task of private GM with an excess error guarantee that scales with the effective diameter of the datapoints. Additionally, we propose an inefficient algorithm based on the inverse smooth sensitivity mechanism, which satisfies the more restrictive notion of pure DP. We complement our results with a lower bound and demonstrate the optimality of our polynomial-time algorithms in terms of sample complexity.

Create account to get full access

Overview

This paper investigates differentially private (DP) algorithms for computing the geometric median (GM) of a dataset.
The geometric median is a central point that minimizes the sum of Euclidean distances to all the data points.
Existing DP-GD methods require strong prior knowledge about the data's location, and their excess error scales linearly with the data's radius.
The paper aims to design efficient and private algorithms with excess error scaling with the effective diameter of the data, which is unknown.

Plain English Explanation

The geometric median is like finding the central point in a group of locations that has the shortest total distance to all the other locations. Imagine a bunch of cities on a map, and you want to find the city that's closest to all the other cities on average.

Existing private algorithms for computing the geometric median require you to know ahead of time how far the data points are spread out. This means you need to know the maximum distance between any data point and the central point. The performance of these algorithms gets worse as this maximum distance increases.

This paper tries to develop new private algorithms that can compute the geometric median without needing to know how spread out the data is ahead of time. The key idea is to design algorithms whose error scales with the actual spread of the data, rather than the maximum possible spread. This can lead to much better performance when the data is concentrated in a smaller region.

The paper presents a few new private algorithms for computing the geometric median, and also shows that these algorithms are optimal in terms of the number of data points needed to achieve a certain level of accuracy.

Technical Explanation

The paper studies the problem of privately estimating the geometric median of a dataset. Given $n$ points $x_1, \dots, x_n$ in $\mathbb{R}^d$, the goal is to find a point $\theta$ that minimizes the sum of the Euclidean distances to these points, i.e., $\sum_{i=1}^{n} |\theta - x_i|_2$.

The authors first observe that off-the-shelf methods, such as DP-GD, require strong a priori knowledge about the location of the data within a ball of radius $R$. The excess risk of these algorithms depends linearly on $R$, which can be problematic if the true radius containing the majority of the data points is much smaller.

To address this limitation, the main contributions of the paper are:

A pair of polynomial-time DP algorithms for private geometric median estimation, with excess error guarantees that scale with the effective diameter of the datapoints, rather than the maximum radius.
An inefficient algorithm based on the inverse smooth sensitivity mechanism, which satisfies the more restrictive notion of pure DP.
A lower bound that demonstrates the optimality of the polynomial-time algorithms in terms of sample complexity.

The key ideas behind the polynomial-time algorithms are to privately estimate the center of mass of the dataset, and then use this estimate as a starting point for a private gradient descent procedure.

Critical Analysis

The paper makes an important contribution by addressing the limitations of existing DP algorithms for geometric median estimation, which require strong prior knowledge about the data distribution. The proposed algorithms are able to achieve excess error guarantees that scale with the effective diameter of the data, rather than the maximum possible radius.

One potential limitation is that the polynomial-time algorithms still require some prior knowledge about the data, specifically a bound on the effective diameter. While this is a weaker assumption than knowing the maximum radius, it may still be challenging to obtain in practice. The authors do present an inefficient algorithm that satisfies the more restrictive notion of pure DP, but its practical applicability may be limited.

Additionally, the paper focuses on the geometric median as the target parameter, but it would be interesting to see if similar techniques can be applied to other centrality measures, such as the trimmed mean or the spatial median.

Overall, this paper makes a valuable contribution to the field of private data analysis, and the proposed algorithms could have important applications in areas where preserving user privacy is crucial.

Conclusion

This paper introduces new differentially private algorithms for computing the geometric median of a dataset, which is a central point that minimizes the sum of distances to all the data points. The key innovation is that the excess error of the proposed algorithms scales with the effective diameter of the data, rather than the maximum possible radius, which can lead to significantly better performance when the data is concentrated in a smaller region.

The paper presents a technical analysis of the proposed algorithms, including their optimality in terms of sample complexity, as well as a critical discussion of their limitations and potential extensions. Overall, this work advances the state of the art in private data analysis and could have important implications for applications where preserving user privacy is a key concern.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Differentially private projection-depth-based medians

Kelly Ramsay, Dylan Spicker

We develop $(epsilon,delta)$-differentially private projection-depth-based medians using the propose-test-release (PTR) and exponential mechanisms. Under general conditions on the input parameters and the population measure, (e.g. we do not assume any moment bounds), we quantify the probability the test in PTR fails, as well as the cost of privacy via finite sample deviation bounds. We then present a new definition of the finite sample breakdown point which applies to a mechanism, and present a lower bound on the finite sample breakdown point of the projection-depth-based median. We demonstrate our main results on the canonical projection-depth-based median, as well as on projection-depth-based medians derived from trimmed estimators. In the Gaussian setting, we show that the resulting deviation bound matches the known lower bound for private Gaussian mean estimation. In the Cauchy setting, we show that the outlier error amplification effect resulting from the heavy tails outweighs the cost of privacy. This result is then verified via numerical simulations. Additionally, we present results on general PTR mechanisms and a uniform concentration result on the projected spacings of order statistics, which may be of general interest.

5/20/2024

cs.CR cs.LG

🚀

Private Mean Estimation with Person-Level Differential Privacy

Sushant Agarwal, Gautam Kamath, Mahbod Majid, Argyris Mouzakis, Rose Silver, Jonathan Ullman

We study differentially private (DP) mean estimation in the case where each person holds multiple samples. Commonly referred to as the user-level setting, DP here requires the usual notion of distributional stability when all of a person's datapoints can be modified. Informally, if $n$ people each have $m$ samples from an unknown $d$-dimensional distribution with bounded $k$-th moments, we show that [n = tilde Thetaleft(frac{d}{alpha^2 m} + frac{d }{ alpha m^{1/2} varepsilon} + frac{d}{alpha^{k/(k-1)} m varepsilon} + frac{d}{varepsilon}right)] people are necessary and sufficient to estimate the mean up to distance $alpha$ in $ell_2$-norm under $varepsilon$-differential privacy (and its common relaxations). In the multivariate setting, we give computationally efficient algorithms under approximate DP (with slightly degraded sample complexity) and computationally inefficient algorithms under pure DP, and our nearly matching lower bounds hold for the most permissive case of approximate DP. Our computationally efficient estimators are based on the well known noisy-clipped-mean approach, but the analysis for our setting requires new bounds on the tails of sums of independent, vector-valued, bounded-moments random variables, and a new argument for bounding the bias introduced by clipping.

6/3/2024

cs.DS cs.CR cs.IT cs.LG stat.ML

🏅

Private Edge Density Estimation for Random Graphs: Optimal, Efficient and Robust

Hongjie Chen, Jingqiu Ding, Yiding Hua, David Steurer

We give the first polynomial-time, differentially node-private, and robust algorithm for estimating the edge density of ErdH{o}s-R'enyi random graphs and their generalization, inhomogeneous random graphs. We further prove information-theoretical lower bounds, showing that the error rate of our algorithm is optimal up to logarithmic factors. Previous algorithms incur either exponential running time or suboptimal error rates. Two key ingredients of our algorithm are (1) a new sum-of-squares algorithm for robust edge density estimation, and (2) the reduction from privacy to robustness based on sum-of-squares exponential mechanisms due to Hopkins et al. (STOC 2023).

6/5/2024

cs.DS cs.LG stat.ML

📉

Mixtures of Gaussians are Privately Learnable with a Polynomial Number of Samples

Mohammad Afzali, Hassan Ashtiani, Christopher Liaw

We study the problem of estimating mixtures of Gaussians under the constraint of differential privacy (DP). Our main result is that $text{poly}(k,d,1/alpha,1/varepsilon,log(1/delta))$ samples are sufficient to estimate a mixture of $k$ Gaussians in $mathbb{R}^d$ up to total variation distance $alpha$ while satisfying $(varepsilon, delta)$-DP. This is the first finite sample complexity upper bound for the problem that does not make any structural assumptions on the GMMs. To solve the problem, we devise a new framework which may be useful for other tasks. On a high level, we show that if a class of distributions (such as Gaussians) is (1) list decodable and (2) admits a locally small'' cover (Bun et al., 2021) with respect to total variation distance, then the class of its mixtures is privately learnable. The proof circumvents a known barrier indicating that, unlike Gaussians, GMMs do not admit a locally small cover (Aden-Ali et al., 2021b).

4/24/2024

stat.ML cs.CR cs.DS cs.IT cs.LG