AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Bayesian Hierarchical Clustering with Exponential Family: Small-Variance Asymptotics and Reducibility

Lee, Juho, Choi, Seungjin

arXiv.org Machine LearningJun-2-2015

Bayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distance-based agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome and limit the scalability. In this paper we relax BHC into a non-probabilistic formulation, exploring small-variance asymptotics in conjugate-exponential models. We develop a novel clustering algorithm, referred to as relaxed BHC (RBHC), from the asymptotic limit of the BHC model that exhibits the scalability of distance-based agglomerative clustering algorithms as well as the flexibility of Bayesian nonparametric models. We also investigate the reducibility of the dissimilarity measure emerged from the asymptotic limit of the BHC model, allowing us to use scalable algorithms such as the nearest neighbor chain algorithm. Numerical experiments on both synthetic and real-world datasets demonstrate the validity and high performance of our method.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1501.0743

Country: Europe (1.00)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)

Add feedback

Formal Concept Analysis for Knowledge Discovery from Biological Data

Raza, Khalid

arXiv.org Artificial IntelligenceJun-1-2015

Due to rapid advancement in high-throughput techniques, such as microarrays and next generation sequencing technologies, biological data are increasing exponentially. The current challenge in computational biology and bioinformatics research is how to analyze these huge raw biological data to extract biologically meaningful knowledge. This review paper presents the applications of formal concept analysis for the analysis and knowledge discovery from biological data, including gene expression discretization, gene co-expression mining, gene expression clustering, finding genes in gene regulatory networks, enzyme/protein classifications, binding site classifications, and so on. It also presents a list of FCA-based software tools applied in biological domain and covers the challenges faced so far.

bioinformatics, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1504/IJDMB.2017.10009312

1506.00366

Country:

Europe (0.93)
Asia > India (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Data Science > Data Mining > Knowledge Discovery (0.63)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Constrained 1-Spectral Clustering

Rangapuram, Syama Sundar, Hein, Matthias

arXiv.org Machine LearningMay-24-2015

An important form of prior information in clustering comes in form of cannot-link and must-link constraints. We present a generalization of the popular spectral clustering technique which integrates such constraints. Motivated by the recently proposed $1$-spectral clustering for the unconstrained problem, our method is based on a tight relaxation of the constrained normalized cut into a continuous optimization problem. Opposite to all other methods which have been suggested for constrained spectral clustering, we can always guarantee to satisfy all constraints. Moreover, our soft formulation allows to optimize a trade-off between normalized cut and the number of violated constraints. An efficient implementation is provided which scales to large datasets. We outperform consistently all other proposed methods in the experiments.

artificial intelligence, constraint, machine learning, (16 more...)

arXiv.org Machine Learning

1505.06485

Country:

North America > United States (0.28)
Europe > Germany > Saarland (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Tight Continuous Relaxation of the Balanced $k$-Cut Problem

Rangapuram, Syama Sundar, Mudrakarta, Pramod Kaushik, Hein, Matthias

arXiv.org Machine LearningMay-24-2015

Spectral Clustering as a relaxation of the normalized/ratio cut has become one of the standard graph-based clustering methods. Existing methods for the computation of multiple clusters, corresponding to a balanced $k$-cut of the graph, are either based on greedy techniques or heuristics which have weak connection to the original motivation of minimizing the normalized cut. In this paper we propose a new tight continuous relaxation for any balanced $k$-cut problem and show that a related recently proposed relaxation is in most cases loose leading to poor performance in practice. For the optimization of our tight continuous relaxation we propose a new algorithm for the difficult sum-of-ratios minimization problem which achieves monotonic descent. Extensive comparisons show that our method outperforms all existing approaches for ratio cut and other balanced $k$-cut criteria.

artificial intelligence, constraint, machine learning, (18 more...)

arXiv.org Machine Learning

1505.06478

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

A Mixture of Generalized Hyperbolic Factor Analyzers

Tortora, Cristina, McNicholas, Paul D., Browne, Ryan P.

arXiv.org Machine LearningMay-22-2015

Model-based clustering imposes a finite mixture modelling structure on data for clustering. Finite mixture models assume that the population is a convex combination of a finite number of densities, the distribution within each population is a basic assumption of each particular model. Among all distributions that have been tried, the generalized hyperbolic distribution has the advantage that is a generalization of several other methods, such as the Gaussian distribution, the skew t-distribution, etc. With specific parameters, it can represent either a symmetric or a skewed distribution. While its inherent flexibility is an advantage in many ways, it means the estimation of more parameters than its special and limiting cases. The aim of this work is to propose a mixture of generalized hyperbolic factor analyzers to introduce parsimony and extend the method to high dimensional data. This work can be seen as an extension of the mixture of factor analyzers model to generalized hyperbolic mixtures. The performance of our generalized hyperbolic factor analyzers is illustrated on real data, where it performs favourably compared to its Gaussian analogue.

artificial intelligence, machine learning, mcnichola, (16 more...)

arXiv.org Machine Learning

doi: 10.1007/s11634-015-0204-z

1311.653

Country:

Europe (1.00)
North America > Canada > Ontario (0.68)
North America > United States (0.67)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Feature Selection as State-Space Search: An Empirical Study in Clustering Problems

Mariño, Julian R. H. (Universidade Federal de Viçosa) | Lelis, Levi H. S. (Universidade Federal de Viçosa)

AAAI ConferencesMay-21-2015

In this paper we treat the problem of feature selection in unsupervised learning as a state-space search problem. We introduce three different heuristic functions and perform extensive experiments on datasets with tens, hundreds, and thousands of features. Namely, we test different search algorithms using the heuristic functions we introduce. Our results show that the heuristic search approach for feature selection in unsupervised learning problems can be far superior than traditional baselines such as PCA and random projections.

algorithm, constraint, heuristic function, (15 more...)

AAAI Conferences

Eighth Annual Symposium on Combinatorial Search

Country: South America > Brazil > Minas Gerais (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

An Experimental Comparison of Several Clustering and Initialization Methods

Meila, Marina, Heckerman, David

arXiv.org Machine LearningMay-16-2015

We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a "winner take all" version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root node, using high-dimensional discrete-variable data sets (both real and synthetic). We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization schemes on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of hierarchical agglomerative clustering. Although the methods are substantially different, they lead to learned models that are strikingly similar in quality.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1301.7401

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback

Ensembling classification models based on phalanxes of variables with applications in drug discovery

Tomal, Jabed H., Welch, William J., Zamar, Ruben H.

arXiv.org Machine LearningMay-15-2015

Statistical detection of a rare class of objects in a two-class classification problem can pose several challenges. Because the class of interest is rare in the training data, there is relatively little information in the known class response labels for model building. At the same time the available explanatory variables are often moderately high dimensional. In the four assays of our drug-discovery application, compounds are active or not against a specific biological target, such as lung cancer tumor cells, and active compounds are rare. Several sets of chemical descriptor variables from computational chemistry are available to classify the active versus inactive class; each can have up to thousands of variables characterizing molecular structure of the compounds. The statistical challenge is to make use of the richness of the explanatory variables in the presence of scant response information. Our algorithm divides the explanatory variables into subsets adaptively and passes each subset to a base classifier. The various base classifiers are then ensembled to produce one model to rank new objects by their estimated probabilities of belonging to the rare class of interest. The essence of the algorithm is to choose the subsets such that variables in the same group work well together; we call such groups phalanxes.

artificial intelligence, descriptor, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1214/14-AOAS778

1303.4805

Country:

North America > United States > California (0.46)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

An iterative step-function estimator for graphons

Cai, Diana, Ackerman, Nathanael, Freer, Cameron

arXiv.org Machine LearningMay-11-2015

Exchangeable graphs arise via a sampling procedure from measurable functions known as graphons. A natural estimation problem is how well we can recover a graphon given a single graph sampled from it. One general framework for estimating a graphon uses step-functions obtained by partitioning the nodes of the graph according to some clustering algorithm. We propose an iterative step-function estimator (ISFE) that, given an initial partition, iteratively clusters nodes based on their edge densities with respect to the previous iteration's partition. We analyze ISFE and demonstrate its performance in comparison with other graphon estimation techniques.

artificial intelligence, graphon, machine learning, (18 more...)

arXiv.org Machine Learning

1412.2129

Country: North America > United States > New York (0.28)

Genre: Research Report (0.82)

Industry:

Government > Military (0.46)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Relations Between Adjacency and Modularity Graph Partitioning

Jiang, Hansi, Meyer, Carl

arXiv.org Machine LearningMay-9-2015

In this paper the exact linear relation between the leading eigenvector of the unnormalized modularity matrix and the eigenvectors of the adjacency matrix is developed. Based on this analysis a method to approximate the leading eigenvector of the modularity matrix is given, and the relative error of the approximation is derived. A complete proof of the equivalence between normalized modularity clustering and normalized adjacency clustering is also given. A new metric is defined to describe the agreement of two clustering methods, and some applications and experiments are given to illustrate and corroborate the points that are made in the theoretical development.

artificial intelligence, machine learning, matrix, (18 more...)

arXiv.org Machine Learning

1505.03481

Country: North America > United States > North Carolina (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)

Add feedback