AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

On the Minimax Misclassification Ratio of Hypergraph Community Detection

Chien, I, Lin, Chung-Yi, Wang, I-Hsiang

arXiv.org Machine LearningFeb-3-2018

Community detection in hypergraphs is explored. Under a generative hypergraph model called "d-wise hypergraph stochastic block model" (d-hSBM) which naturally extends the Stochastic Block Model from graphs to d-uniform hypergraphs, the asymptotic minimax mismatch ratio is characterized. For proving the achievability, we propose a two-step polynomial time algorithm that achieves the fundamental limit. The first step of the algorithm is a hypergraph spectral clustering method which achieves partial recovery to a certain precision level. The second step is a local refinement method which leverages the underlying probabilistic model along with parameter estimation from the outcome of the first step. To characterize the asymptotic performance of the proposed algorithm, we first derive a sufficient condition for attaining weak consistency in the hypergraph spectral clustering step. Then, under the guarantee of weak consistency in the first step, we upper bound the worst-case risk attained in the local refinement step by an exponentially decaying function of the size of the hypergraph and characterize the decaying rate. For proving the converse, the lower bound of the minimax mismatch ratio is set by finding a smaller parameter space which contains the most dominant error events, inspired by the analysis in the achievability part. It turns out that the minimax mismatch ratio decays exponentially fast to zero as the number of nodes tends to infinity, and the rate function is a weighted combination of several divergence terms, each of which is the Renyi divergence of order 1/2 between two Bernoulli's. The Bernoulli's involved in the characterization of the rate function are those governing the random instantiation of hyperedges in d-hSBM. Experimental results on synthetic data validate our theoretical finding that the refinement step is critical in achieving the optimal statistical limit.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1802.00926

Country:

Europe (0.45)
North America > United States (0.28)

Genre: Workflow (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.66)

Add feedback

Call Detail Record Analysis – K-means Clustering with R

@machinelearnbotJan-30-2018, 23:34:41 GMT

From the above plot, it is evident that the clusters 1, 7, and 9 have activity for all 24 hours and are the more revenue generating clusters. The clusters 1, 5, 7, 9, and 10 have activity in night hours. The cluster 5 has activity from 11.5 to 17 hours.

artificial intelligence, information, machine learning, (14 more...)

@machinelearnbot

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.91)

Add feedback

COBRA: A Fast and Simple Method for Active Clustering with Pairwise Constraints

Van Craenendonck, Toon, Dumancic, Sebastijan, Blockeel, Hendrik

arXiv.org Machine LearningJan-30-2018

Clustering is inherently ill-posed: there often exist multiple valid clusterings of a single dataset, and without any additional information a clustering system has no way of knowing which clustering it should produce. This motivates the use of constraints in clustering, as they allow users to communicate their interests to the clustering system. Active constraint-based clustering algorithms select the most useful constraints to query, aiming to produce a good clustering using as few constraints as possible. We propose COBRA, an active method that first over-clusters the data by running K-means with a $K$ that is intended to be too large, and subsequently merges the resulting small clusters into larger ones based on pairwise constraints. In its merging step, COBRA is able to keep the number of pairwise queries low by maximally exploiting constraint transitivity and entailment. We experimentally show that COBRA outperforms the state of the art in terms of clustering quality and runtime, without requiring the number of clusters in advance.

constraint, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1801.09955

Country:

North America > United States (0.28)
Europe > Belgium (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Weighted Community Detection and Data Clustering Using Message Passing

Shi, Cheng, Liu, Yanchen, Zhang, Pan

arXiv.org Machine LearningJan-29-2018

Grouping objects into clusters based on similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message passing algorithms and spectral algorithms proposed for unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to Potts model at the critical temperature of spin glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks and use extensive numerical experiments to illustrate the advantage of our method over existing algorithms. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem when the data was generated by mixture models in the sparse regime we show that our method works to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which reduce heavily the computation complexity in dense-networks but gives almost the same performance as belief propagation.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1801.09829

Country:

Asia > China (0.28)
North America > United States > Massachusetts (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

Graph Based Analysis for Gene Segment Organization In a Scrambled Genome

Hajij, Mustafa, Jonoska, Nataša, Kukushkin, Denys, Saito, Masahico

arXiv.org Machine LearningJan-28-2018

DNA rearrangement processes recombine gene segments that are organized on the chromosome in a variety of ways. The segments can overlap, interleave or one may be a subsegment of another. We use directed graphs to represent segment organizations on a given locus where contigs containing rearranged segments represent vertices and the edges correspond to the segment relationships. Using graph properties we associate a point in a higher dimensional Euclidean space to each graph such that cluster formations and analysis can be performed with methods from topological data analysis. The method is applied to a recently sequenced model organism \textit{Oxytricha trifallax}, a species of ciliate with highly scrambled genome that undergoes massive rearrangement process after conjugation. The analysis shows some emerging star-like graph structures indicating that segments of a single gene can interleave, or even contain all of the segments from fifteen or more other genes in between its segments. We also observe that as many as six genes can have their segments mutually interleaving or overlapping.

gene segment organization, graph, vertex, (15 more...)

arXiv.org Machine Learning

1801.05922

Country:

North America > United States > Florida > Hillsborough County > Tampa (0.14)
North America > United States > Illinois > Champaign County > Champaign (0.04)
Asia > Japan (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

Clustering based on the In-tree Graph Structure and Affinity Propagation

Qiu, Teng, Li, Yongjie

arXiv.org Machine LearningJan-28-2018

A recently proposed clustering method, called the Nearest Descent (ND), can organize the whole dataset into a sparsely connected graph, called the In-tree. This ND-based Intree structure proves able to reveal the clustering structure underlying the dataset, except one imperfect place, that is, there are some undesired edges in this In-tree which require to be removed. Here, we propose an effective way to automatically remove the undesired edges in In-tree via an effective combination of the In-tree structure with affinity propagation (AP). The key for the combination is to add edges between the reachable nodes in In-tree before using AP to remove the undesired edges. The experiments on both synthetic and real datasets demonstrate the effectiveness of the proposed method.

dataset, in-tree, node, (16 more...)

arXiv.org Machine Learning

1501.04318

Country:

Asia > China > Sichuan Province > Chengdu (0.04)
Europe > Finland > North Karelia > Joensuu (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Marketing Analytics: Methods, Practice, Implementation, and Links to Other Fields

France, Stephen L., Ghose, Sanjoy

arXiv.org Machine LearningJan-28-2018

Marketing analytics is a diverse field, with both academic researchers and practitioners coming from a range of backgrounds including marketing, operations research, statistics, and computer science. This paper provides an integrative review at the boundary of these three areas. The topics of visualization, segmentation, and class prediction are featured. Links between the disciplines are emphasized. For each of these topics, a historical overview is given, starting with initial work in the 1960s and carrying through to the present day. Recent innovations for modern large and complex "big data" sets are described. Practical implementation advice is given, along with a directory of open source R routines for implementing marketing analytics techniques.

data mining, machine learning, segmentation, (24 more...)

arXiv.org Machine Learning

1801.09185

Country:

Europe (1.00)
North America > United States > California (0.67)
Asia (0.67)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.67)

Industry:

Marketing (1.00)
Information Technology > Services (1.00)

Technology:

Information Technology > Enterprise Applications > Customer Relationship Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(8 more...)

Add feedback

Covariance-based Dissimilarity Measures Applied to Clustering Wide-sense Stationary Ergodic Processes

Peng, Qidi, Rao, Nan, Zhao, Ran

arXiv.org Machine LearningJan-27-2018

We introduce a new unsupervised learning problem: clustering wide-sense stationary ergodic stochastic processes. A covariance-based dissimilarity measure and consistent algorithms are designed for clustering offline and online data settings, respectively. We also suggest a formal criterion on the efficiency of dissimilarity measures, and discuss of some approach to improve the efficiency of clustering algorithms, when they are applied to cluster particular type of processes, such as self-similar processes with wide-sense stationary ergodic increments. Clustering synthetic data sampled from fractional Brownian motions is provided as an example of application.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

1801.09049

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Energy (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Multivariate normal mixture modeling, clustering and classification with the rebmix package

Nagode, Marko

arXiv.org Machine LearningJan-26-2018

The rebmix package provides R functions for random univariate and multivariate finite mixture model generation, estimation, clustering and classification. The paper is focused on multivariate normal mixture models with unrestricted variance-covariance matrices. The objective is to show how to generate datasets for a known number of components, numbers of observations and component parameters, how to estimate the number of components, component weights and component parameters and how to predict cluster and class membership based upon a model trained by the REBMIX algorithm. The accompanying plotting, bootstrapping and other features of the package are dealt with, too. For demonstration purpose a multivariate normal dataset with unrestricted variance-covariance matrices is studied.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Machine Learning

1801.08788

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Rademacher Complexity Bounds for a Penalized Multiclass Semi-Supervised Algorithm

Maximov, Yury, Amini, Massih-Reza, Harchaoui, Zaid

arXiv.org Machine LearningJan-25-2018

We propose Rademacher complexity bounds for multiclass classifiers trained with a two-step semi-supervised model. In the first step, the algorithm partitions the partially labeled data and then identifies dense clusters containing $\kappa$ predominant classes using the labeled training examples such that the proportion of their non-predominant classes is below a fixed threshold. In the second step, a classifier is trained by minimizing a margin empirical loss over the labeled training set and a penalization term measuring the disability of the learner to predict the $\kappa$ predominant classes of the identified clusters. The resulting data-dependent generalization error bound involves the margin distribution of the classifier, the stability of the clustering technique used in the first step and Rademacher complexity terms corresponding to partially labeled training data. Our theoretical result exhibit convergence rates extending those proposed in the literature for the binary case, and experimental results on different multiclass classification problems show empirical evidence that supports the theory.

artificial intelligence, inductive learning, machine learning, (16 more...)

arXiv.org Machine Learning

1607.00567

Country:

North America > United States (1.00)
North America > Canada (0.93)
Europe (0.93)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)

Add feedback