la Tour, Max Dupré
Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means
la Tour, Max Dupré, Saulpic, David
The k-means objective function was introduced by Lloyd in 1957 (and published later in [Llo82]) as a measure of the quality of compression. Given a set of points P and an integer k, minimizing the k-means objective yields a set of k centers that provide a good compressed representation of the original dataset P. Lloyd's original motivation was to compress analog audio signals into numerical ones: numerical signals have to be discrete, and Lloyd proposed a method to find which frequencies should be kept in the discretization. His method was a heuristic trying to minimize what he called the quantization error, which is the sum, for each point, of the squared distance to its representative. This is precisely the k-means cost, and the goal of the k-means problem is to find the set of k representatives (or centers) that minimizes this cost. In contrast, the k-median cost function is the sum, for each point, of the distance to its closest center, inherently giving less weight to the outliers in the dataset.
Making Old Things New: A Unified Algorithm for Differentially Private Clustering
la Tour, Max Dupré, Henzinger, Monika, Saulpic, David
As a staple of data analysis and unsupervised learning, the problem of private clustering has been widely studied under various privacy models. Centralized differential privacy is the first of them, and the problem has also been studied for the local and the shuffle variation. In each case, the goal is to design an algorithm that computes privately a clustering, with the smallest possible error. The study of each variation gave rise to new algorithms: the landscape of private clustering algorithms is therefore quite intricate. In this paper, we show that a 20-year-old algorithm can be slightly modified to work for any of these models. This provides a unified picture: while matching almost all previously known results, it allows us to improve some of them and extend it to a new privacy model, the continual observation setting, where the input is changing over time and the algorithm must output a new solution at each time step.
Gerrymandering Planar Graphs
Dippel, Jack, la Tour, Max Dupré, Niu, April, Roy, Sanjukta, Vetta, Adrian
We study the computational complexity of the map redistricting problem (gerrymandering). Mathematically, the electoral district designer (gerrymanderer) attempts to partition a weighted graph into $k$ connected components (districts) such that its candidate (party) wins as many districts as possible. Prior work has principally concerned the special cases where the graph is a path or a tree. Our focus concerns the realistic case where the graph is planar. We prove that the gerrymandering problem is solvable in polynomial time in $\lambda$-outerplanar graphs, when the number of candidates and $\lambda$ are constants and the vertex weights (voting weights) are polynomially bounded. In contrast, the problem is NP-complete in general planar graphs even with just two candidates. This motivates the study of approximation algorithms for gerrymandering planar graphs. However, when the number of candidates is large, we prove it is hard to distinguish between instances where the gerrymanderer cannot win a single district and instances where the gerrymanderer can win at least one district. This immediately implies that the redistricting problem is inapproximable in polynomial time in planar graphs, unless P=NP. This conclusion appears terminal for the design of good approximation algorithms -- but it is not. The inapproximability bound can be circumvented as it only applies when the maximum number of districts the gerrymanderer can win is extremely small, say one. Indeed, for a fixed number of candidates, our main result is that there is a constant factor approximation algorithm for redistricting unweighted planar graphs, provided the optimal value is a large enough constant.
Differential Privacy for Clustering Under Continual Observation
la Tour, Max Dupré, Henzinger, Monika, Saulpic, David
We consider the problem of clustering privately a dataset in $\mathbb{R}^d$ that undergoes both insertion and deletion of points. Specifically, we give an $\varepsilon$-differentially private clustering mechanism for the $k$-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number $T$ of updates. The multiplicative error is almost the same as non privately. To do so we show how to perform dimension reduction under continual observation and combine it with a differentially private greedy approximation algorithm for $k$-means. We also partially extend our results to the $k$-median problem.