The problem of hierarchical clustering items from pairwise similarities is found across various scientific disciplines, from biology to networking. Often, applications of clustering techniques are limited by the cost of obtaining similarities between pairs of items. While prior work has been developed to reconstruct clustering using a significantly reduced set of pairwise similarities via adaptive measurements, these techniques are only applicable when choice of similarities are available to the user. In this paper, we examine reconstructing hierarchical clustering under similarity observations at-random. We derive precise bounds which show that a significant fraction of the hierarchical clustering can be recovered using fewer than all the pairwise similarities. We find that the correct hierarchical clustering down to a constant fraction of the total number of items (i.e., clusters sized O(N)) can be found using only O(N log N) randomly selected pairwise similarities in expectation.
In this article, we study the notion of similarity within the context of cluster analysis. We begin by studying different distances commonly used for this task and highlight certain important properties that they might have, such as the use of data distribution or reduced sensitivity to the curse of dimensionality. Then we study inter- and intra-cluster similarities. We identify how the choices made can influence the nature of the clusters.
Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and running time. To combat the challenges associated with large datasets, we propose a general framework for active hierarchical clustering that repeatedly runs an off-the-shelf clustering algorithm on small subsets of the data and comes with guarantees on performance, measurement complexity and runtime complexity. We instantiate this framework with a simple spectral clustering algorithm and provide concrete results on its performance, showing that, under some assumptions, this algorithm recovers all clusters of size ?(log n) using O(n log^2 n) similarities and runs in O(n log^3 n) time for a dataset of n objects. Through extensive experimentation we also demonstrate that this framework is practically alluring.
The goal of the neighborhood explorer tool was to create a way for people to explore similarities between different areas in DC. The resulting Tableau dashboard depicts a subset of DC's census blocks, grouped into colors based on their similarity in terms of points of interest. The points of interest dataset from Open Data DC includes schools, historical sites, museums, monuments, galleries, fire and police stations, libraries, public services, universities, hospitals, rec centers, etc. DC is chock full of fascinating places to visit and things to do, providing abundant inspiration to build the tool. I ended up with 10 clusters, which I assigned to colors based on my perception of their persona. Clustering is an art, not a science, so this process could have gone any number of ways.
This one is for all the football crazy data scientists! Watching Robert Lewandowski banging in goal after goal every week seems very normal since a good number of years now. His career has been filled with goals at every stage, culminating last season with the magnificent treble. Though he shows no signs of slowing down we must acknowledge that he is 32 now, well beyond a striker's prime and is likely to start declining in the near future. When the time comes Bayern will find it quite difficult to find an apt replacement and rightly so, players like him are a rarity.