Goto

Collaborating Authors

 Clustering


Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

arXiv.org Machine Learning

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm PAM, partitioning around medoids, also known as k-medoids. In Euclidean geometry the mean--as used in k-means--is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or even more complex distances. A key issue with PAM is, however, its high run time cost. In this paper, we propose modifications to the PAM algorithm where at the cost of storing O(k) additional values, we can achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. We also show how the CLARA and CLARANS algorithms benefit from this modification. In experiments on real data with k=100, we observed a 200 fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.


On The Equivalence of Tries and Dendrograms - Efficient Hierarchical Clustering of Traffic Data

arXiv.org Artificial Intelligence

The widespread use of GPS-enabled devices generates voluminous and continuous amounts of traffic data but analyzing such data for interpretable and actionable insights poses challenges. A hierarchical clustering of the trips has many uses such as discovering shortest paths, common routes and often traversed areas. However, hierarchical clustering typically has time complexity of $O(n^2 \log n)$ where $n$ is the number of instances, and is difficult to scale to large data sets associated with GPS data. Furthermore, incremental hierarchical clustering is still a developing area. Prefix trees (also called tries) can be efficiently constructed and updated in linear time (in $n$). We show how a specially constructed trie can compactly store the trips and further show this trie is equivalent to a dendrogram that would have been built by classic agglomerative hierarchical algorithms using a specific distance metric. This allows creating hierarchical clusterings of GPS trip data and updating this hierarchy in linear time. %we can extract a meaningful kernel and can also interpret the structure as clusterings of differing granularity as one progresses down the tree. We demonstrate the usefulness of our proposed approach on a real world data set of half a million taxis' GPS traces, well beyond the capabilities of agglomerative clustering methods. Our work is not limited to trip data and can be used with other data with a string representation.


Identification of Invariant Sensorimotor Structures as a Prerequisite for the Discovery of Objects

arXiv.org Artificial Intelligence

Perceiving the surrounding environment in terms of objects is useful for any general purpose intelligent agent. In this paper, we investigate a fundamental mechanism making object perception possible, namely the identification of spatio-temporally invariant structures in the sensorimotor experience of an agent. We take inspiration from the Sensorimotor Contingencies Theory to define a computational model of this mechanism through a sensorimotor, unsupervised and predictive approach. Our model is based on processing the unsupervised interaction of an artificial agent with its environment. We show how spatio-temporally invariant structures in the environment induce regularities in the sensorimotor experience of an agent, and how this agent, while building a predictive model of its sensorimotor experience, can capture them as densely connected subgraphs in a graph of sensory states connected by motor commands. Our approach is focused on elementary mechanisms, and is illustrated with a set of simple experiments in which an agent interacts with an environment. We show how the agent can build an internal model of moving but spatio-temporally invariant structures by performing a Spectral Clustering of the graph modeling its overall sensorimotor experiences. We systematically examine properties of the model, shedding light more globally on the specificities of the paradigm with respect to methods based on the supervised processing of collections of static images.


Semi-supervised clustering for de-duplication

arXiv.org Machine Learning

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph $G$ with the edges labelled $0$ and $1$, the goal is to find a clustering that minimizes the number of $0$ edges within a cluster plus the number of $1$ edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^*$ with edges corresponding to points in the same cluster being labelled $0$ and other edges being labelled $1$. Under the promise that the edge difference between $G$ and $G^*$ is "small", we prove that finding the optimal clustering (or $G^*$) is still NP-Hard. [Ashtiani et. al, 2016] introduced the framework of semi-supervised clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters. We further prove that even with access to a same-cluster oracle, the promise version is NP-Hard as long as the number queries to the oracle is not too large ($o(n)$ where $n$ is the number of vertices). Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class $\mathcal F$ of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.


Introducing a hybrid model of DEA and data mining in evaluating efficiency. Case study: Bank Branches

arXiv.org Artificial Intelligence

The banking industry is very important for an economic cycle of each country and provides some quality of services for us. With the advancement in technology and rapidly increasing of the complexity of the business environment, it has become more competitive than the past so that efficiency analysis in the banking industry attracts much attention in recent years. From many aspects, such analyses at the branch level are more desirable. Evaluating the branch performance with the purpose of eliminating deficiency can be a crucial issue for branch managers to measure branch efficiency. This work not only can lead to a better understanding of bank branch performance but also give further information to enhance managerial decisions to recognize problematic areas. To achieve this purpose, this study presents an integrated approach based on Data Envelopment Analysis (DEA), Clustering algorithms and Polynomial Pattern Classifier for constructing a classifier to identify a class of bank branches. First, the efficiency estimates of individual branches are evaluated by using the DEA approach. Next, when the range and number of classes were identified by experts, the number of clusters is identified by an agglomerative hierarchical clustering algorithm based on some statistical methods. Next, we divide our raw data into k clusters By means of self-organizing map (SOM) neural networks. Finally, all clusters are fed into the reduced multivariate polynomial model to predict the classes of data.


Deep clustering: On the link between discriminative models and K-means

arXiv.org Machine Learning

In the context of recent deep clustering studies, discriminative models dominate the literature and report the most competitive performances. These models learn a deep discriminative neural network classifier in which the labels are latent. Typically, they use multinomial logistic regression posteriors and parameter regularization, as is very common in supervised learning. It is generally acknowledged that discriminative objective functions (e.g., those based on the mutual information or the KL divergence) are more flexible than generative approaches (e.g., K-means) in the sense that they make fewer assumptions about the data distributions and, typically, yield much better unsupervised deep learning results. On the surface, several recent discriminative models may seem unrelated to K-means. This study shows that these models are, in fact, equivalent to K-means under mild conditions and common posterior models and parameter regularization. We prove that, for the commonly used logistic regression posteriors, maximizing the $L_2$ regularized mutual information via an approximate alternating direction method (ADM) is equivalent to a soft and regularized K-means loss. Our theoretical analysis not only connects directly several recent state-of-the-art discriminative models to K-means, but also leads to a new soft and regularized deep K-means algorithm, which yields competitive performance on several image clustering benchmarks.


Iterative Initial Centroid Search via Sampling for k-Means Clustering

#artificialintelligence

In this post, we will look at using an iterative approach to searching for a better set of initial centroids for k-means clustering, and will do so by performing this process on a sample of our full dataset. What do we mean by "better?" Since k-means clustering aims to converge on an optimal set of cluster centers (centroids) and cluster membership based on distance from these centroids via successive iterations, it is intuitive that the more optimal the positioning of these initial centroids, the fewer iterations of the k-means clustering algorithms will be required for convergence. Therefore, thinking about ways to find a better set of initial centroid positions is a valid approach to optimizing the k-means clustering process. What we will do differently, specifically, is to draw a sample of data from our full dataset, and run short runs of of the k-means clustering algorithm on it (not to convergence), short runs which will include, out of necessity, the centroid initialization process.


Unique Metric for Health Analysis with Optimization of Clustering Activity and Cross Comparison of Results from Different Approach

arXiv.org Machine Learning

In machine learning and data mining, Cluster analysis is one of the most widely used unsupervised learning technique. Philosophy of this algorithm is to find similar data items and group them together based on any distance function in multidimensional space. These methods are suitable for finding groups of data that behave in a coherent fashion. The perspective may vary for clustering i.e. the way we want to find similarity, some methods are based on distance such as K-Means technique and some are probability based, like GMM. Understanding prominent segment of data is always challenging as multidimension space does not allow us to have a look and feel of the distance or any visual context on the health of the clustering. While explaining data using clusters, the major problem is to tell how many cluster are good enough to explain the data. Generally basic descriptive statistics are used to estimate cluster behaviour like scree plot, dendrogram etc. We propose a novel method to understand the cluster behaviour which can be used not only to find right number of clusters but can also be used to access the difference of health between different clustering methods on same data. Our technique would also help to also eliminate the noisy variables and optimize the clustering result. keywords - Clustering, Metric, K-means, hierarchical clustering, silhoutte, clustering index, measures


Hierarchical clustering that takes advantage of both density-peak and density-connectivity

arXiv.org Machine Learning

This paper focuses on density-based clustering, particularly the Density Peak (DP) algorithm and the one based on density-connectivity DBSCAN; and proposes a new method which takes advantage of the individual strengths of these two methods to yield a density-based hierarchical clustering algorithm. Our investigation begins with formally defining the types of clusters DP and DBSCAN are designed to detect; and then identifies the kinds of distributions that DP and DBSCAN individually fail to detect all clusters in a dataset. These identified weaknesses inspire us to formally define a new kind of clusters and propose a new method called DC-HDP to overcome these weaknesses to identify clusters with arbitrary shapes and varied densities. In addition, the new method produces a richer clustering result in terms of hierarchy or dendrogram for better cluster structures understanding. Our empirical evaluation results show that DC-HDP produces the best clustering results on 14 datasets in comparison with 7 state-of-the-art clustering algorithms.


Top Machine Learning Algorithms You Should Know to Become a Data Scientist - DZone AI

#artificialintelligence

There are two ways to categorize Machine Learning algorithms you may come across in the field. Generally, both approaches are useful. However, we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types. There are different ways an algorithm can model a problem as it relates to the interaction with the experience. However, it doesn't matter whatever we want to call the input data. Also, an algorithm is popular in Machine Learning and Artificial Intelligence textbooks. That is to first consider the learning styles that an algorithm can adapt. Generally, there are only a few main learning styles that a Machine Learning algorithm can have.