Goto

Collaborating Authors

 own cluster


Crowdsourced clustering via active querying

AIHub

For more details, please read our full paper, Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees, from HCOMP 2023.


Hierarchical Clustering: A Practical Introduction of Agglomerative and Divisive Methods

#artificialintelligence

In this article, we are going to talk in detail about hierarchical clustering like Why we need hierarchical clustering?, How hierarchical clustering works?, Types of hierarchical clustering?, On which dataset it is applicable? . Before moving forward to hierarchal clustering, we should know why we are talking about hierarchical clustering? even when we have K Means clustering. If you have studied K Means then you know that this algorithm works on the distance to centroid method to create a cluster. Although it works well if you have well defined boundaries type dataset that has less outliers. In above picture, K Means is working well but when we move towards some complex datasets then the problem arises and K Means don't work properly. As you can see in below picture, K Means is failing in making clusters.


Introduction Hierarchical Clustering

#artificialintelligence

Clustering tries to find structure in data by creating groupings of data with similar characteristics. The most famous clustering algorithm is likely K-means, but there are a large number of ways to cluster observations. Hierarchical clustering is an alternative class of clustering algorithms that produce 1 to n clusters, where n is the number of observations in the data set. As you go down the hierarchy from 1 cluster (contains all the data) to n clusters (each observation is its own cluster), the clusters become more and more similar (almost always). There are two types of hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).


Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation

arXiv.org Machine Learning

Across a variety of scientific disciplines, sparse inverse covariance estimation is a popular tool for capturing the underlying dependency relationships in multivariate data. Unfortunately, most estimators are not scalable enough to handle the sizes of modern high-dimensional data sets (often on the order of terabytes), and assume Gaussian samples. To address these deficiencies, we introduce HP-CONCORD, a highly scalable optimization method for estimating a sparse inverse covariance matrix based on a regularized pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal gradient method uses a novel communication-avoiding linear algebra algorithm and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving parallel scalability on problems with up to ~819 billion parameters (1.28 million dimensions); even on a single node, HP-CONCORD demonstrates scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to estimate the underlying dependency structure of the brain from fMRI data, and use the result to identify functional regions automatically. The results show good agreement with a clustering from the neuroscience literature.


Must-Know: How to determine the most useful number of clusters?

@machinelearnbot

Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post. With supervised learning, the number of classes in a particular set of data is known outright, since each data instance in labeled as a member of a particular existent class. In the worst case, we can scan the class attribute and count up the number of unique entries which exist. With unsupervised learning, the idea of class attributes and explicit class membership does not exist; in fact, one of the dominant forms of unsupervised learning -- data clustering -- aims to approximate class membership by minimizing interclass instance similarity and maximizing intraclass similarity.


Must-Know: How to determine the most useful number of clusters?

@machinelearnbot

With unsupervised learning, the idea of class attributes and explicit class membership does not exist; in fact, one of the dominant forms of unsupervised learning -- data clustering -- aims to approximate class membership by minimizing interclass instance similarity and maximizing intraclass similarity. We will have a look at 2 particular popular methods for attempting to answer this question: the elbow method and the silhouette method. It should be self-evident that, in order to plot this variance against varying numbers of clusters, varying numbers of clusters must be tested. The silhouette method measures the similarity of an object to its own cluster -- called cohesion -- when compared to other clusters -- called separation.


Must-Know: How to determine the most useful number of clusters?

@machinelearnbot

Editor's note: This post was originally included as an answer to a question posed in our 17 More Must-Know Data Science Interview Questions and Answers series earlier this year. The answer was thorough enough that it was deemed to deserve its own dedicated post. With supervised learning, the number of classes in a particular set of data is known outright, since each data instance in labeled as a member of a particular existent class. In the worst case, we can scan the class attribute and count up the number of unique entries which exist. With unsupervised learning, the idea of class attributes and explicit class membership does not exist; in fact, one of the dominant forms of unsupervised learning -- data clustering -- aims to approximate class membership by minimizing interclass instance similarity and maximizing intraclass similarity.