k-means
Distributionally Robust K-Means Clustering
Malik, Vikrant, Kargin, Taylan, Hassibi, Babak
In recent years, the widespreadavailability of large-scale, high-dimensionaldatasets has driven significant interest in clustering algorithms that are both computationally efficient and robust to distributional shifts and outliers. The classical clustering method, K-means, can be seen as an application of the Lloyd-Max quantization algorithm, in which the distribution being quantized is the empirical distribution of the points to be clustered. This empirical distribution generally differs from the true underlying distribution, especially when the number of points to be clustered is small. This induces a distributional shift, which can also arise in many real-world settings, such as image segmentation, biological data analysis, and sensor networks, due to noise variations, sensor inaccuracies, or environmental changes. Distributional shifts can severely impact the performance of clustering algorithms, leading to degraded cluster assignments and unreliable downstream analysis. The field of clustering has a rich history. One of the most popular algorithms in this field is theK-means (KM) algorithm, introduced by [1], which computes centroids by iteratively updating the conditional mean of the data in the Voronoi regions induced by the centroids. However, standardK-means is sensitive to initialization and, in general, converges only to a local minimum.
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
- (2 more...)
On the Optimal Number of Grids for Differentially Private Non-Interactive $K$-Means Clustering
Muthukrishnan, Gokularam, Tandon, Anshoo
Differentially private $K$-means clustering enables releasing cluster centers derived from a dataset while protecting the privacy of the individuals. Non-interactive clustering techniques based on privatized histograms are attractive because the released data synopsis can be reused for other downstream tasks without additional privacy loss. The choice of the number of grids for discretizing the data points is crucial, as it directly controls the quantization bias and the amount of noise injected to preserve privacy. The widely adopted strategy selects a grid size that is independent of the number of clusters and also relies on empirical tuning. In this work, we revisit this choice and propose a refined grid-size selection rule derived by minimizing an upper bound on the expected deviation in the K-means objective function, leading to a more principled discretization strategy for non-interactive private clustering. Compared to prior work, our grid resolution differs both in its dependence on the number of clusters and in the scaling with dataset size and privacy budget. Extensive numerical results elucidate that the proposed strategy results in accurate clustering compared to the state-of-the-art techniques, even under tight privacy budgets.
- North America > United States (0.14)
- Asia > India > Karnataka > Bengaluru (0.04)
Distributed Gradient Clustering: Convergence and the Effect of Initialization
Armacki, Aleksandar, Sharma, Himkant, Bajović, Dragana, Jakovetić, Dušan, Chakraborty, Mrityunjoy, Kar, Soummya
We study the effects of center initialization on the performance of a family of distributed gradient-based clustering algorithms introduced in [1], that work over connected networks of users. In the considered scenario, each user contains a local dataset and communicates only with its immediate neighbours, with the aim of finding a global clustering of the joint data. We perform extensive numerical experiments, evaluating the effects of center initialization on the performance of our family of methods, demonstrating that our methods are more resilient to the effects of initialization, compared to centralized gradient clustering [2]. Next, inspired by the $K$-means++ initialization [3], we propose a novel distributed center initialization scheme, which is shown to improve the performance of our methods, compared to the baseline random initialization.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- Europe > Serbia > Vojvodina > South Bačka District > Novi Sad (0.05)
- Asia > India > West Bengal > Kharagpur (0.04)
- (5 more...)
fbefa505c8e8bf6d46f38f5277fed8d6-AuthorFeedback.pdf
We would like to point out that K-means is used only once to17 initialize the representativesets and isnot anintrinsic component ofthe online algorithm. What is important to observe21 is that N is kept constant throughout in order to reduce the storage footprint and to ensure low-complexity online22 processing. Tomaintain the list constant, for every added point another point is removed. Also, the reviewer is correct24 in observing thatN does not feature in the convergence results, which are asymptotic and do not imply anything25 about the convergence rate. Clearly, if the point dimensionm is large, it is beneficial to increaseN.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Afghanistan > Parwan Province > Charikar (0.05)
- Europe > Italy > Lazio > Rome (0.04)
- (7 more...)
- North America > United States > Illinois (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > India > West Bengal > Kolkata (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)