Goto

Collaborating Authors

 Aditya Bhaskara




Greedy Sampling for Approximate Clustering in the Presence of Outliers

Neural Information Processing Systems

Greedy algorithms such as adaptive sampling (k-means++) and furthest point traversal are popular choices for clustering problems. One the one hand, they possess good theoretical approximation guarantees, and on the other, they are fast and easy to implement. However, one main issue with these algorithms is the sensitivity to noise/outliers in the data. In this work we show that for k-means and k-center clustering, simple modifications to the well-studied greedy algorithms result in nearly identical guarantees, while additionally being robust to outliers. For instance, in the case of k-means++, we show that a simple thresholding operation on the distances suffices to obtain an O(log k) approximation to the objective. We obtain similar results for the simpler k-center problem. Finally, we show experimentally that our algorithms are easy to implement and scale well. We also measure their ability to identify noisy points added to a dataset.



Linear Relaxations for Finding Diverse Elements in Metric Spaces

Neural Information Processing Systems

Choosing a diverse subset of a large collection of points in a metric space is a fundamental problem, with applications in feature selection, recommender systems, web search, data summarization, etc. Various notions of diversity have been proposed, tailored to different applications. The general algorithmic goal is to find a subset of points that maximize diversity, while obeying a cardinality (or more generally, matroid) constraint. The goal of this paper is to develop a novel linear programming (LP) framework that allows us to design approximation algorithms for such problems. We study an objective known as sum-min diversity, which is known to be effective in many applications, and give the first constant factor approximation algorithm. Our LP framework allows us to easily incorporate additional constraints, as well as secondary objectives. We also prove a hardness result for two natural diversity objectives, under the so-called planted clique assumption. Finally, we study the empirical performance of our algorithm on several standard datasets.