Goto

Collaborating Authors

 vassilvitskii


CoresetforLine-SetsClustering

Neural Information Processing Systems

A natural generalization is to replace this input setP of n points by a setP of n sets inX. The distance from such an input setP P to a setC of centers can then be defined as the distance between the closest point-center pair. This problem is calledk-mean for sets; see e.g.


ImprovedGuaranteesfork-means + + andk-means ++Parallel

Neural Information Processing Systems

Lloyd's algorithm uses iterative improvements to find a locally optimalk-means clustering. The performance of Lloyd'salgorithm crucially depends on the quality of the initial clustering, which isdefined bytheinitial setofcenters, called aseed.








Algorithms for Caching and MTS with reduced number of predictions

Sadek, Karim Abdel, Elias, Marek

arXiv.org Artificial Intelligence

ML-augmented algorithms utilize predictions to achieve performance beyond their worst-case bounds. Producing these predictions might be a costly operation - this motivated Im et al. (2022) to introduce the study of algorithms which use predictions parsimoniously. We design parsimonious algorithms for caching and MTS with action predictions, proposed by Antoniadis et al. (2023), focusing on the parameters of consistency (performance with perfect predictions) and smoothness (dependence of their performance on the prediction error). Our algorithm for caching is 1-consistent, robust, and its smoothness deteriorates with the decreasing number of available predictions. We propose an algorithm for general MTS whose consistency and smoothness both scale linearly with the decreasing number of predictions. Without the restriction on the number of available predictions, both algorithms match the earlier guarantees achieved by Antoniadis et al. (2023). Caching, introduced by Sleator and Tarjan (1985), is a fundamental problem in online computation important both in theory and practice. Here, we have a fast memory (cache) which can contain up to k different pages and we receive a sequence of requests to pages in an online manner. Whenever a page is requested, it needs to be loaded in the cache. Therefore, if the requested page is already in the cache, it can be accessed at no cost. Otherwise, we suffer a page fault: we have to evict one page from the cache and load the requested page in its place. The page to evict is to be chosen without knowledge of the future requests and our target is to minimize the total number of page faults. Caching is a special case of Metrical Task Systems introduced by Borodin et al. (1992) as a generalization of many fundamental online problems. In the beginning, we are given a metric space M of states which can be interpreted as actions or configurations of some system. A recently emerging field of learning-augmented algorithms, introduced in seminal papers by Kraska et al. (2018) and Lykouris and Vassilvitskii (2021), investigates approaches to improve the performance of algorithms using predictions, possibly generated by some ML model.


Fast Distributed k-Center Clustering with Outliers on Massive Data

Neural Information Processing Systems

Clustering large data is a fundamental problem with a vast number of applications. Due to the increasing size of data, practitioners interested in clustering have turned to distributed computation methods. In this work, we consider the widely used k-center clustering problem and its variant used to handle noisy data, k-center with outliers. In the noise-free setting we demonstrate how a previously-proposed distributed method is actually an O(1)-approximation algorithm, which accurately explains its strong empirical performance. Additionally, in the noisy setting, we develop a novel distributed algorithm that is also an O(1)-approximation. These algorithms are highly parallel and lend themselves to virtually any distributed computing framework. We compare each empirically against the best known sequential clustering methods and show that both distributed algorithms are consistently close to their sequential versions. The algorithms are all one can hope for in distributed settings: they are fast, memory efficient and they match their sequential counterparts.