AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Information theoretic model validation for clustering

Buhmann, Joachim M.

arXiv.org Machine LearningJun-2-2010

Model selection in clustering requires (i) to specify a suitable clustering principle and (ii) to control the model order complexity by choosing an appropriate number of clusters depending on the noise level in the data. We advocate an information theoretic perspective where the uncertainty in the measurements quantizes the set of data partitionings and, thereby, induces uncertainty in the solution space of clusterings. A clustering model, which can tolerate a higher level of fluctuations in the measurements than alternative models, is considered to be superior provided that the clustering solution is equally informative. This tradeoff between \emph{informativeness} and \emph{robustness} is used as a model selection criterion. The requirement that data partitionings should generalize from one data set to an equally probable second data set gives rise to a new notion of structure induced information.

approximation, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1006.0375

Country: North America > United States (0.69)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets

Murtagh, Fionn, Contreras, Pedro

arXiv.org Machine LearningMay-14-2010

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. "Structure" can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy here, including ultrametric topology, generalized ultrametric, linkages with lattices and other discrete algebraic structures and with p-adic number representations. By focusing on symmetries in data we have a powerful means of structuring and analyzing massive, high dimensional data stores. We illustrate the powerfulness of hierarchical clustering in case studies in chemistry and finance, and we provide pointers to other published case studies.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Machine Learning

1005.2638

Country:

North America > United States (0.46)
Europe > United Kingdom (0.28)

Genre: Overview (0.87)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Exploratory Analysis of Functional Data via Clustering and Optimal Segmentation

Hébrail, Georges, Hugueney, Bernard, Lechevallier, Yves, Rossi, Fabrice

arXiv.org Machine LearningApr-3-2010

We propose in this paper an exploratory analysis algorithm for functional data. The method partitions a set of functions into $K$ clusters and represents each cluster by a simple prototype (e.g., piecewise constant). The total number of segments in the prototypes, $P$, is chosen by the user and optimally distributed among the clusters via two dynamic programming algorithms. The practical relevance of the method is shown on two real world datasets.

artificial intelligence, machine learning, partition, (18 more...)

arXiv.org Machine Learning

doi: 10.1016/j.neucom.2009.11.022

1004.0456

Country:

Europe > France (0.28)
Europe > Belgium (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

From Frequency to Meaning: Vector Space Models of Semantics

Turney, P. D., Pantel, P.

Journal of Artificial Intelligence ResearchFeb-27-2010

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

Journal of Artificial Intelligence Research

doi: 10.1613/jair.2934

AI Access Foundation

10640

Journal of Artificial Intelligence Research

Country:

North America > United States > Ohio > Franklin County > Columbus (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
(46 more...)

Genre: Overview (1.00)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area (0.93)
Government (0.67)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Operator norm convergence of spectral clustering on level sets

Pelletier, Bruno, Pudlo, Pierre

arXiv.org Machine LearningFeb-11-2010

The aim of data clustering, or unsupervised classification, is to partition a data set into several homogeneous groups relatively separated one from each other with respect to a certain distance or notion of similarity. There exists an extensive literature on clustering methods, and we refer the reader to Anderberg [1973], Hartigan [1975], McLachlan and Peel [2000], Chapter 10 in Duda et al. [2000], and Chapter 14 in Hastie et al. [2001] for general materials on the subject. In particular, popular clustering algorithms, such as Gaussian mixture models or k-means, have proved useful in a number of applications, yet they suffer from some internal and computational limitations. Indeed, the parametric assumption at the core of mixture models may be too stringent, while the standard k-means algorithm fails at identifying complex shaped, possibly non-convex, clusters. The class of spectral clustering algorithms is presently emerging as a promising alternative, showing improved performance over classical clustering algorithms on several benchmark problems and applications; see e.g., Ng et al. [2002], von Luxburg [2007]. An overview of spectral clustering algorithms may be found in von Luxburg [2007], and connections with kernel methods are exposed in Fillipone et al. [2008]. The spectral clustering algorithm amounts at embedding the data into a feature space by using the eigenvectors of the similarity matrix in such a way that the clusters may be separated using simple rules, e.g. a separation by hyperplanes. The core component of the spectral clustering algorithm is therefore the similarity matrix, or certain normalizations of it, generally called graph Laplacian matrices; see Chung [1997]. Graph Laplacian matrices may be viewed as discrete versions of bounded operators between functional spaces.

algorithm, convergence, spectral, (16 more...)

arXiv.org Machine Learning

1002.2313

Country:

North America > United States > New York (0.04)
Europe > France > Occitanie > Hérault > Montpellier (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Classifying the typefaces of the Gutenberg 42-line bible

Alabert, Aureli, Rangel, Luz Ma.

arXiv.org Machine LearningJan-31-2010

We have measured the dissimilarities among several printed characters of a single page in the Gutenberg 42-line bible and we prove statistically the existence of several different matrices from which the metal types where constructed. This is in contrast with the prevailing theory, which states that only one matrix per character was used in the printing process of Gutenberg's greatest work. The main mathematical tool for this purpose is cluster analysis, combined with a statistical test for outliers. We carry out the research with two letters, i and a. In the first case, an exact clustering method is employed; in the second, with more specimens to be classified, we resort to an approximate agglomerative clustering method. The results show that the letters form clusters according to their shape, with significant shape differences among clusters, and allow to conclude, with a very small probability of error, that indeed the metal types used to print them were cast from several different matrices. Mathematics Subject Classification: 62H30

arXiv.org Machine Learning

1002.014

Country:

North America > United States > New York (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

K-tree: Large Scale Document Clustering

De Vries, Christopher M., Geva, Shlomo

arXiv.org Artificial IntelligenceJan-6-2010

We introduce K-tree in an information retrieval context. It is an efficient approximation of the k-means clustering algorithm. Unlike k-means it forms a hierarchy of clusters. It has been extended to address issues with sparse representations. We compare performance and quality to CLUTO using document collections. The K-tree has a low time complexity that is suitable for large document collections. This tree structure allows for efficient disk based implementations where space requirements exceed that of main memory.

artificial intelligence, k-tree, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/1571941.1572094

1001.083

Country:

Oceania > Australia > Queensland > Brisbane (0.05)
Europe > Netherlands > South Holland > Delft (0.05)
North America > United States > Massachusetts > Suffolk County > Boston (0.05)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.96)

Add feedback

Document Clustering with K-tree

De Vries, Christopher M., Geva, Shlomo

arXiv.org Artificial IntelligenceJan-6-2010

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.

data mining, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-642-03761-0_43

1001.0827

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Illinois (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Graph Quantization

Jain, Brijnesh J., Obermayer, Klaus

arXiv.org Artificial IntelligenceJan-6-2010

Vector quantization(VQ) is a lossy data compression technique from signal processing, which is restricted to feature vectors and therefore inapplicable for combinatorial structures. This contribution presents a theoretical foundation of graph quantization (GQ) that extends VQ to the domain of attributed graphs. We present the necessary Lloyd-Max conditions for optimality of a graph quantizer and consistency results for optimal GQ design based on empirical distortion measures and stochastic optimization. These results statistically justify existing clustering algorithms in the domain of graphs. The proposed approach provides a template of how to link structural pattern recognition methods other than GQ to statistical pattern recognition.

artificial intelligence, graph, machine learning, (15 more...)

arXiv.org Artificial Intelligence

1001.0921

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Influence of graph construction on graph-based clustering measures

Maier, Markus, Luxburg, Ulrike V., Hein, Matthias

Neural Information Processing SystemsDec-31-2009

Graph clustering methods such as spectral clustering are defined for general weighted graphs. In machine learning, however, data often is not given in form of a graph, but in terms of similarity (or distance) values between points. In this case, first a neighborhood graph is constructed using the similarities between the points and then a graph clustering algorithm is applied to this graph. In this paper we investigate the influence of the construction of the similarity graph on the clustering results. We first study the convergence of graph clustering criteria such as the normalized cut (Ncut) as the sample size tends to infinity. We find that the limit expressions are different for different types of graph, for example the r-neighborhood graph or the k-nearest neighbor graph. In plain words: Ncut on a kNN graph does something systematically different than Ncut on an r-neighborhood graph! This finding shows that graph clustering criteria cannot be studied independently of the kind of graph they are applied to. We also provide examples which show that these differences can be observed for toy and real data already for rather small sample sizes.

graph, knn graph, r-neighborhood graph, (15 more...)

Neural Information Processing Systems

Country:

North America (0.14)
Europe > Germany > Saarland > Saarbrücken (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback