Goto

Collaborating Authors

 Clustering


Python: Implementing a k-means algorithm with sklearn

@machinelearnbot

The below is an example of how sklearn in Python can be used to develop a k-means clustering algorithm. The purpose of k-means clustering is to be able to partition observations in a dataset into a specific number of clusters in order to aid in analysis of the data. From this perspective, it has particular value from a data visualisation perspective. The particular example used here is that of stock returns. Specifically, the k-means scatter plot will illustrate the clustering of specific stock returns according to their dividend yield.


[Clustering] How and Where should you cut a dendrogram?

#artificialintelligence

Hierarchical clustering methods produce dendrograms which contain more information than mere flat clustering, for instance cluster proximity. A particular hierarchical clustering method, namely Single-Linkage, enjoys several nice theoretical properties (Zadeh and Ben-David, 2009) and (Carlsson and Mรฉmoli, 2010) despite being known to give poor results in practice. Nonetheless, in (Awasthi et al., 2012), authors show that if the center-based clustering instance verifies some perturbation resilient properties, finding the optimal center-based clustering is possible in polynomial time by cutting efficiently the Single-Linkage dendrogram while in the general case, finding the optimal clustering is NP-hard. The common practice to flatten dendrograms in $k$ clusters is to cut them off at constant height $k-1$. Yet it leads to poorer clusters than efficiently pruning the tree.


Clustering with Noisy Queries

arXiv.org Machine Learning

In this paper, we initiate a rigorous theoretical study of clustering with noisy queries (or a faulty oracle). Given a set of $n$ elements, our goal is to recover the true clustering by asking minimum number of pairwise queries to an oracle. Oracle can answer queries of the form : "do elements $u$ and $v$ belong to the same cluster?" -- the queries can be asked interactively (adaptive queries), or non-adaptively up-front, but its answer can be erroneous with probability $p$. In this paper, we provide the first information theoretic lower bound on the number of queries for clustering with noisy oracle in both situations. We design novel algorithms that closely match this query complexity lower bound, even when the number of clusters is unknown. Moreover, we design computationally efficient algorithms both for the adaptive and non-adaptive settings. The problem captures/generalizes multiple application scenarios. It is directly motivated by the growing body of work that use crowdsourcing for {\em entity resolution}, a fundamental and challenging data mining task aimed to identify all records in a database referring to the same entity. Here crowd represents the noisy oracle, and the number of queries directly relates to the cost of crowdsourcing. Another application comes from the problem of {\em sign edge prediction} in social network, where social interactions can be both positive and negative, and one must identify the sign of all pair-wise interactions by querying a few pairs. Furthermore, clustering with noisy oracle is intimately connected to correlation clustering, leading to improvement therein. Finally, it introduces a new direction of study in the popular {\em stochastic block model} where one has an incomplete stochastic block model matrix to recover the clusters.


On comparing clusterings: an element-centric framework unifies overlaps and hierarchy

arXiv.org Machine Learning

Clustering is one of the most universal approaches for understanding complex data. A pivotal aspect of clustering analysis is quantitatively comparing clusterings; clustering comparison is the basis for tasks such as clustering evaluation, consensus clustering, and tracking the temporal evolution of clusters. For example, the extrinsic evaluation of clustering methods requires comparing the uncovered clusterings to planted clusterings or known metadata. Yet, as we demonstrate, existing clustering comparison measures have critical biases which un- dermine their usefulness, and no measure accommodates both overlapping and hierarchical clusterings. Here we unify the comparison of disjoint, overlapping, and hierarchically struc- tured clusterings by proposing a new element-centric framework: elements are compared based on the relationships induced by the cluster structure, as opposed to the traditional cluster-centric philosophy. We demonstrate that, in contrast to standard clustering simi- larity measures, our framework does not suffer from critical biases and naturally provides unique insights into how the clusterings differ. We illustrate the strengths of our framework by revealing new insights into the organization of clusters in two applications: the improved classification of schizophrenia based on the overlapping and hierarchical community struc- ture of fMRI brain networks, and the disentanglement of various social homophily factors in Facebook social networks. The universality of clustering suggests far-reaching impact of our framework throughout all areas of science.


Adiabatic Quantum Computing for Binary Clustering

arXiv.org Machine Learning

Quantum computing promises fast solutions to a wide range of optimization problems and thus holds considerable potential for machine learning [1]-[3]. However, while the quantum machine learning literature so far mainly focused on the quantum gate paradigm, noticeable technological progress leading to commercial devices is happening in adiabatic quantum computing [4], [5]. Current adiabatic quantum computers are geared towards solving quadratic unconstrained binary optimization problems or Ising models. A simple strategy for setting up established learning algorithms to run on such devices is therefore to attempt to (re-)formulate or approximate their minimization or maximization objectives in terms of Ising models. In this paper, we apply this strategy to a simple unsupervised learning problem, namely binary clustering.


K-means Clustering with Tableau โ€“ Call Detail Records Example

@machinelearnbot

In this blog, we will discuss about clustering of customer activities for 24 hours by using K-means clustering feature in Tableau 10. This type of clustering helps you create statistically-based segments that provide insights about similarities in different groups and performance of the groups when compared to each other. You can use clustering on any type of visualization ranging from scatter plots to text tables and even maps. In our previous blog post โ€“ "Call Detail Record Analysis โ€“ K-means Clustering with R", we have discussed about CDR analysis using unsupervised K-means clustering algorithm. A daily activity file from Dandelion API is used as a data source, where the file contains CDR records generated by the Telecom Italia cellular network over the city of Milano.


Comparing Spectral partitioning / clustering (with Normalized Graph Laplacian) with KMeans Clustering in R

@machinelearnbot

The following simpler spectral partitioning approach (thresholding on the second dominant eigenvector) can also be applied for automatic separation of the foreground from the background.


K-Means & Other Clustering Algorithms: A Quick Intro with Python

@machinelearnbot

Clustering is the grouping of objects together so that objects belonging in the same group (cluster) are more similar to each other than those in other groups (clusters). In this intro cluster analysis tutorial, we'll check out a few algorithms in Python so you can get a basic understanding of the fundamentals of clustering on a real dataset. For the clustering problem, we will use the famous Zachary's Karate Club dataset. The story behind the data set is quite simple: There was a Karate Club that had an administrator "John A" and an instructor "Mr. Then a conflict arose between them, causing the students (Nodes) to split into two groups.


The Informativeness of $k$-Means and Dimensionality Reduction for Learning Mixture Models

arXiv.org Machine Learning

The learning of mixture models can be viewed as a clustering problem. Indeed, given data samples independently generated from a mixture of distributions, we often would like to find the correct target clustering of the samples according to which component distribution they were generated from. For a clustering problem, practitioners often choose to use the simple k-means algorithm. k-means attempts to find an optimal clustering which minimizes the sum-of-squared distance between each point and its cluster center. In this paper, we provide sufficient conditions for the closeness of any optimal clustering and the correct target clustering assuming that the data samples are generated from a mixture of log-concave distributions. Moreover, we show that under similar or even weaker conditions on the mixture model, any optimal clustering for the samples with reduced dimensionality is also close to the correct target clustering. These results provide intuition for the informativeness of k-means (with and without dimensionality reduction) as an algorithm for learning mixture models. We verify the correctness of our theorems using numerical experiments and demonstrate using datasets with reduced dimensionality significant speed ups for the time required to perform clustering.


Fast Approximate Spectral Clustering for Dynamic Networks

arXiv.org Machine Learning

Spectral clustering is a widely studied problem, yet its complexity is prohibitive for dynamic graphs of even modest size. We claim that it is possible to reuse information of past cluster assignments to expedite computation. Our approach builds on a recent idea of sidestepping the main bottleneck of spectral clustering, i.e., computing the graph eigenvectors, by using fast Chebyshev graph filtering of random signals. We show that the proposed algorithm achieves clustering assignments with quality approximating that of spectral clustering and that it can yield significant complexity benefits when the graph dynamics are appropriately bounded.