Goto

Collaborating Authors

 k-means


Distributionally Robust K-Means Clustering

Malik, Vikrant, Kargin, Taylan, Hassibi, Babak

arXiv.org Machine Learning

In recent years, the widespreadavailability of large-scale, high-dimensionaldatasets has driven significant interest in clustering algorithms that are both computationally efficient and robust to distributional shifts and outliers. The classical clustering method, K-means, can be seen as an application of the Lloyd-Max quantization algorithm, in which the distribution being quantized is the empirical distribution of the points to be clustered. This empirical distribution generally differs from the true underlying distribution, especially when the number of points to be clustered is small. This induces a distributional shift, which can also arise in many real-world settings, such as image segmentation, biological data analysis, and sensor networks, due to noise variations, sensor inaccuracies, or environmental changes. Distributional shifts can severely impact the performance of clustering algorithms, leading to degraded cluster assignments and unreliable downstream analysis. The field of clustering has a rich history. One of the most popular algorithms in this field is theK-means (KM) algorithm, introduced by [1], which computes centroids by iteratively updating the conditional mean of the data in the Voronoi regions induced by the centroids. However, standardK-means is sensitive to initialization and, in general, converges only to a local minimum.


On the Optimal Number of Grids for Differentially Private Non-Interactive $K$-Means Clustering

Muthukrishnan, Gokularam, Tandon, Anshoo

arXiv.org Machine Learning

Differentially private $K$-means clustering enables releasing cluster centers derived from a dataset while protecting the privacy of the individuals. Non-interactive clustering techniques based on privatized histograms are attractive because the released data synopsis can be reused for other downstream tasks without additional privacy loss. The choice of the number of grids for discretizing the data points is crucial, as it directly controls the quantization bias and the amount of noise injected to preserve privacy. The widely adopted strategy selects a grid size that is independent of the number of clusters and also relies on empirical tuning. In this work, we revisit this choice and propose a refined grid-size selection rule derived by minimizing an upper bound on the expected deviation in the K-means objective function, leading to a more principled discretization strategy for non-interactive private clustering. Compared to prior work, our grid resolution differs both in its dependence on the number of clusters and in the scaling with dataset size and privacy budget. Extensive numerical results elucidate that the proposed strategy results in accurate clustering compared to the state-of-the-art techniques, even under tight privacy budgets.


Distributed Gradient Clustering: Convergence and the Effect of Initialization

Armacki, Aleksandar, Sharma, Himkant, Bajović, Dragana, Jakovetić, Dušan, Chakraborty, Mrityunjoy, Kar, Soummya

arXiv.org Machine Learning

We study the effects of center initialization on the performance of a family of distributed gradient-based clustering algorithms introduced in [1], that work over connected networks of users. In the considered scenario, each user contains a local dataset and communicates only with its immediate neighbours, with the aim of finding a global clustering of the joint data. We perform extensive numerical experiments, evaluating the effects of center initialization on the performance of our family of methods, demonstrating that our methods are more resilient to the effects of initialization, compared to centralized gradient clustering [2]. Next, inspired by the $K$-means++ initialization [3], we propose a novel distributed center initialization scheme, which is shown to improve the performance of our methods, compared to the baseline random initialization.


fbefa505c8e8bf6d46f38f5277fed8d6-AuthorFeedback.pdf

Neural Information Processing Systems

We would like to point out that K-means is used only once to17 initialize the representativesets and isnot anintrinsic component ofthe online algorithm. What is important to observe21 is that N is kept constant throughout in order to reduce the storage footprint and to ensure low-complexity online22 processing. Tomaintain the list constant, for every added point another point is removed. Also, the reviewer is correct24 in observing thatN does not feature in the convergence results, which are asymptotic and do not imply anything25 about the convergence rate. Clearly, if the point dimensionm is large, it is beneficial to increaseN.