AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Histogram-Based Method for Effective Initialization of the K-Means Clustering Algorithm

Gingles, Caroline (Louisiana State University in Shreveport) | Celebi, M. Emre (Louisiana State University in Shreveport)

AAAI ConferencesMay-7-2014

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, this algorithm is highly sensitive to the initial selection of the cluster centers. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have superlinear complexity in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the order in which the data points are processed. These methods are generally unreliable in that the quality of their results is unpredictable. In this paper, we propose a linear, deterministic, and order-invariant initialization method based on multidimensional histograms. Experiments on a diverse collection of data sets from the UCI Machine Learning Repository demonstrate the superiority of our method over the well-known maximin method.

artificial intelligence, k-means clustering algorithm, machine learning, (2 more...)

AAAI Conferences

The Twenty-Seventh International Flairs Conference

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Ridge Fusion in Statistical Learning

Price, Bradley S., Geyer, Charles J., Rothman, Adam J.

arXiv.org Machine LearningMay-5-2014

We propose a penalized likelihood method to jointly estimate multiple precision matrices for use in quadratic discriminant analysis and model based clustering. A ridge penalty and a ridge fusion penalty are used to introduce shrinkage and promote similarity between precision matrix estimates. Block-wise coordinate descent is used for optimization, and validation likelihood is used for tuning parameter selection. Our method is applied in quadratic discriminant analysis and semi-supervised model based clustering.

artificial intelligence, machine learning, simulation, (13 more...)

arXiv.org Machine Learning

1310.3892

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.56)

Add feedback

Piecewise regression mixture for simultaneous functional data clustering and optimal segmentation

Chamroukhi, Faicel

arXiv.org Machine LearningApr-30-2014

This paper introduces a novel mixture model-based approach for simultaneous clustering and optimal segmentation of functional data which are curves presenting regime changes. The proposed model consists in a finite mixture of piecewise polynomial regression models. Each piecewise polynomial regression model is associated with a cluster, and within each cluster, each piecewise polynomial component is associated with a regime (i.e., a segment). We derive two approaches for learning the model parameters. The former is an estimation approach and consists in maximizing the observed-data likelihood via a dedicated expectation-maximization (EM) algorithm. A fuzzy partition of the curves in K clusters is then obtained at convergence by maximizing the posterior cluster probabilities. The latter however is a classification approach and optimizes a specific classification likelihood criterion through a dedicated classification expectation-maximization (CEM) algorithm. The optimal curve segmentation is performed by using dynamic programming. In the classification approach, both the curve clustering and the optimal segmentation are performed simultaneously as the CEM learning proceeds. We show that the classification approach is the probabilistic version that generalizes the deterministic K-means-like algorithm proposed in H\'ebrail et al. (2010). The proposed approach is evaluated using simulated curves and real-world curves. Comparisons with alternatives including regression mixture models and the K-means like algorithm for piecewise regression demonstrate the effectiveness of the proposed approach.

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Machine Learning

1312.6974

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Model Based Clustering of High-Dimensional Binary Data

Tang, Yang, Browne, Ryan P., McNicholas, Paul D.

arXiv.org Machine LearningApr-26-2014

We propose a mixture of latent trait models with common slope parameters (MCLT) for model-based clustering of high-dimensional binary data, a data type for which few established methods exist. Recent work on clustering of binary data, based on a $d$-dimensional Gaussian latent variable, is extended by incorporating common factor analyzers. Accordingly, our approach facilitates a low-dimensional visual representation of the clusters. We extend the model further by the incorporation of random block effects. The dependencies in each block are taken into account through block-specific parameters that are considered to be random variables. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. Our approach is demonstrated on real and simulated data.

artificial intelligence, latent variable, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1016/j.csda.2014.12.009

1404.3174

Country:

North America > United States (0.93)
North America > Canada > Ontario (0.28)

Genre: Research Report (0.64)

Industry:

Government > Regional Government (0.93)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback

Classifying pairs with trees for supervised biological network inference

Schrynemackers, Marie, Wehenkel, Louis, Babu, M. Madan, Geurts, Pierre

arXiv.org Machine LearningApr-24-2014

Networks are ubiquitous in biology and computational approaches have been largely investigated for their inference. In particular, supervised machine learning methods can be used to complete a partially known network by integrating various measurements. Two main supervised frameworks have been proposed: the local approach, which trains a separate model for each network node, and the global approach, which trains a single model over pairs of nodes. Here, we systematically investigate, theoretically and empirically, the exploitation of tree-based ensemble methods in the context of these two approaches for biological network inference. We first formalize the problem of network inference as classification of pairs, unifying in the process homogeneous and bipartite graphs and discussing two main sampling schemes. We then present the global and the local approaches, extending the later for the prediction of interactions between two unseen network nodes, and discuss their specializations to tree-based ensemble methods, highlighting their interpretability and drawing links with clustering techniques. Extensive computational experiments are carried out with these methods on various biological networks that clearly highlight that these methods are competitive with existing methods.

artificial intelligence, machine learning, prediction, (19 more...)

arXiv.org Machine Learning

1404.6074

Country: Europe (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Solution Path Clustering with Adaptive Concave Penalty

Marchetti, Yuliya, Zhou, Qing

arXiv.org Machine LearningApr-24-2014

Fast accumulation of large amounts of complex data has created a need for more sophisticated statistical methodologies to discover interesting patterns and better extract information from these data. The large scale of the data often results in challenging high-dimensional estimation problems where only a minority of the data shows specific grouping patterns. To address these emerging challenges, we develop a new clustering methodology that introduces the idea of a regularization path into unsupervised learning. A regularization path for a clustering problem is created by varying the degree of sparsity constraint that is imposed on the differences between objects via the minimax concave penalty with adaptive tuning parameters. Instead of providing a single solution represented by a cluster assignment for each object, the method produces a short sequence of solutions that determines not only the cluster assignment but also a corresponding number of clusters for each solution. The optimization of the penalized loss function is carried out through an MM algorithm with block coordinate descent. The advantages of this clustering algorithm compared to other existing methods are as follows: it does not require the input of the number of clusters; it is capable of simultaneously separating irrelevant or noisy observations that show no grouping pattern, which can greatly improve data interpretation; it is a general methodology that can be applied to many clustering problems. We test this method on various simulated datasets and on gene expression data, where it shows better or competitive performance compared against several clustering methods.

artificial intelligence, machine learning, noise, (17 more...)

arXiv.org Machine Learning

doi: 10.1214/14-EJS934

1404.6289

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.81)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Comparative study Between Fuzzy Clustering Algorithm and Hard Clustering Algorithm

Bora, Dibya Jyoti, Gupta, Dr. Anil Kumar

arXiv.org Artificial IntelligenceApr-24-2014

Data clustering is an important area of data mining. This is an unsupervised study where data of similar types are put into one cluster while data of another types are put into different cluster. Fuzzy C means is a very important clustering technique based on fuzzy logic. Also we have some hard clustering techniques available like K-means among the popular ones. In this paper a comparative study is done between Fuzzy clustering algorithm and hard clustering algorithm.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.14445/22312803/IJCTT-V10P119

1404.6059

Country: North America > United States > California (0.15)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Unsupervised Text Extraction from G-Maps

Adak, Chandranath

arXiv.org Artificial IntelligenceApr-24-2014

This paper represents an text extraction method from Google maps, GIS maps/images. Due to an unsupervised approach there is no requirement of any prior knowledge or training set about the textual and non-textual parts. Fuzzy CMeans clustering technique is used for image segmentation and Prewitt method is used to detect the edges. Connected component analysis and gridding technique enhance the correctness of the results. The proposed method reaches 98.5% accuracy level on the basis of experimental data sets.

machine learning, natural language, text extraction, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICHCI-IEEE.2013.6887782

1404.6075

Country: Asia > India > West Bengal (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

Clustering via Mode Seeking by Direct Estimation of the Gradient of a Log-Density

Sasaki, Hiroaki, Hyvärinen, Aapo, Sugiyama, Masashi

arXiv.org Machine LearningApr-20-2014

Mean shift clustering finds the modes of the data probability density by identifying the zero points of the density gradient. Since it does not require to fix the number of clusters in advance, the mean shift has been a popular clustering algorithm in various application fields. A typical implementation of the mean shift is to first estimate the density by kernel density estimation and then compute its gradient. However, since good density estimation does not necessarily imply accurate estimation of the density gradient, such an indirect two-step approach is not reliable. In this paper, we propose a method to directly estimate the gradient of the log-density without going through density estimation. The proposed method gives the global solution analytically and thus is computationally efficient. We then develop a mean-shift-like fixed-point algorithm to find the modes of the density for clustering. As in the mean shift, one does not need to set the number of clusters in advance. We empirically show that the proposed clustering method works much better than the mean shift especially for high-dimensional data. Experimental results further indicate that the proposed method outperforms existing clustering methods.

artificial intelligence, gradient, machine learning, (16 more...)

arXiv.org Machine Learning

1404.5028

Country: Europe > Finland (0.15)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hierarchical Quasi-Clustering Methods for Asymmetric Networks

Carlsson, Gunnar, Mémoli, Facundo, Ribeiro, Alejandro, Segarra, Santiago

arXiv.org Machine LearningApr-17-2014

This paper introduces hierarchical quasi-clustering methods, a generalization of hierarchical clustering for asymmetric networks where the output structure preserves the asymmetry of the input data. We show that this output structure is equivalent to a finite quasi-ultrametric space and study admissibility with respect to two desirable properties. We prove that a modified version of single linkage is the only admissible quasi-clustering method. Moreover, we show stability of the proposed method and we establish invariance properties fulfilled by it. Algorithms are further developed and the value of quasi-clustering analysis is illustrated with a study of internal migration within United States.

artificial intelligence, hierarchical quasi-clustering method, machine learning, (16 more...)

arXiv.org Machine Learning

1404.4655

Country: North America > United States > California (0.28)

Genre: Research Report (0.40)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance (0.93)
Energy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback