AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Clustering processes

Ryabko, Daniil

arXiv.org Machine LearningApr-30-2013

The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency, and show that simple consistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. (again, no parametric or independence assumptions). In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1005.0826

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Add feedback

On latent position inference from doubly stochastic messaging activities

Lee, Nam H., Yoder, Jordan, Tang, Minh, Priebe, Carey E

arXiv.org Machine LearningApr-25-2013

We model messaging activities as a hierarchical doubly stochastic point process with three main levels, and develop an iterative algorithm for inferring actors' relative latent positions from a stream of messaging activity data. Each of the message-exchanging actors is modeled as a process in a latent space. The actors' latent positions are assumed to be influenced by the distribution of a much larger population over the latent space. Each actor's movement in the latent space is modeled as being governed by two parameters that we call confidence and visibility, in addition to dependence on the population distribution. The messaging frequency between a pair of actors is assumed to be inversely proportional to the distance between their latent positions. Our inference algorithm is based on a projection approach to an online filtering problem. The algorithm associates each actor with a probability density-valued process, and each probability density is assumed to be a mixture of basis functions. For efficient numerical experiments, we further develop our algorithm for the case where the basis functions are obtained by translating and scaling a standard Gaussian density.

actor, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1205.592

Country: North America > United States (0.93)

Genre: Research Report (1.00)

Technology:

Information Technology > Communications (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

The K-modes algorithm for clustering

Carreira-Perpiñán, Miguel Á., Wang, Weiran

arXiv.org Machine LearningApr-23-2013

Many clustering algorithms exist that estimate a cluster centroid, such as K-means, K-medoids or mean-shift, but no algorithm seems to exist that clusters data by returning exactly K meaningful modes. We propose a natural definition of a K-modes objective function by combining the notions of density and cluster assignment. The algorithm becomes K-means and K-medoids in the limit of very large and very small scales. Computationally, it is slightly slower than K-means but much faster than mean-shift or K-medoids. Unlike K-means, it is able to find centroids that are valid patterns, truly representative of a cluster, even with nonconvex clusters, and appears robust to outliers and misspecification of the scale and number of clusters.

artificial intelligence, centroid, machine learning, (18 more...)

arXiv.org Machine Learning

1304.6478

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.55)

Add feedback

Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data

Savage, Richard S., Ghahramani, Zoubin, Griffin, Jim E., Kirk, Paul, Wild, David L.

arXiv.org Machine LearningApr-15-2013

We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 \times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 \times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Machine Learning

1304.3577

Country: Europe > United Kingdom (0.68)

Genre: Research Report > Experimental Study (0.70)

Industry:

Health & Medicine > Therapeutic Area > Oncology > Childhood Cancer (0.84)
Health & Medicine > Therapeutic Area > Oncology > Brain Cancer (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

Towards more accurate clustering method by using dynamic time warping

Ghanem, Khadoudja

arXiv.org Machine LearningApr-12-2013

An intrinsic problem of classifiers based on machine learning (ML) methods is that their learning time grows as the size and complexity of the training dataset increases. For this reason, it is important to have efficient computational methods and algorithms that can be applied on large datasets, such that it is still possible to complete the machine learning tasks in reasonable time. In this context, we present in this paper a more accurate simple process to speed up ML methods. An unsupervised clustering algorithm is combined with Expectation, Maximization (EM) algorithm to develop an efficient Hidden Markov Model (HMM) training. The idea of the proposed process consists of two steps. In the first step, training instances with similar inputs are clustered and a weight factor which represents the frequency of these instances is assigned to each representative cluster. Dynamic Time Warping technique is used as a dissimilarity function to cluster similar examples. In the second step, all formulas in the classical HMM training algorithm (EM) associated with the number of training instances are modified to include the weight factor in appropriate terms. This process significantly accelerates HMM training while maintaining the same initial, transition and emission probabilities matrixes as those obtained with the classical HMM training algorithm. Accordingly, the classification accuracy is preserved. Depending on the size of the training set, speedups of up to 2200 times is possible when the size is about 100.000 instances. The proposed approach is not limited to training HMMs, but it can be employed for a large variety of MLs methods.

algorithm, mining & knowledge management process, sequence, (14 more...)

arXiv.org Machine Learning

doi: 10.5121/ijdkp.2013.3207

1304.3745

Country:

North America > United States (0.04)
North America > Canada (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(7 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Detecting Overlapping Temporal Community Structure in Time-Evolving Networks

Chen, Yudong, Kawadia, Vikas, Urgaonkar, Rahul

arXiv.org Machine LearningMar-28-2013

We present a principled approach for detecting overlapping temporal community structure in dynamic networks. Our method is based on the following framework: find the overlapping temporal community structure that maximizes a quality function associated with each snapshot of the network subject to a temporal smoothness constraint. A novel quality function and a smoothness constraint are proposed to handle overlaps, and a new convex relaxation is used to solve the resulting combinatorial optimization problem. We provide theoretical guarantees as well as experimental results that reveal community structure in real and synthetic networks. Our main insight is that certain structures can be identified only when temporal correlation is considered and when communities are allowed to overlap. In general, discovering such overlapping temporal community structure can enhance our understanding of real-world complex networks by revealing the underlying stability behind their seemingly chaotic evolution.

artificial intelligence, machine learning, overlap, (17 more...)

arXiv.org Machine Learning

1303.7226

Country:

South America (1.00)
Oceania (1.00)
North America (1.00)
(3 more...)

Genre: Research Report (0.82)

Industry: Government (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Machine Learning, Clustering, and Polymorphy

Hanson, Stephen Jose, Bauer, Malcolm

arXiv.org Artificial IntelligenceMar-27-2013

This paper describes a machine induction program (WITT) that attempts to model human categorization. Properties of categories to which human subjects are sensitive includes best or prototypical members, relative contrasts between putative categories, and polymorphy (neither necessary or sufficient features). This approach represents an alternative to usual Artificial Intelligence approaches to generalization and conceptual clustering which tend to focus on necessary and sufficient feature rules, equivalence classes, and simple search and match schemes. WITT is shown to be more consistent with human categorization while potentially including results produced by more traditional clustering schemes. Applications of this approach in the domains of expert systems and information retrieval are also discussed.

category, machine learning, simulation of human behavior, (18 more...)

arXiv.org Artificial Intelligence

1304.3432

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.36)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)
Information Technology > Artificial Intelligence > Cognitive Science > Simulation of Human Behavior (0.34)

Add feedback

Generalizing k-means for an arbitrary distance matrix

Szalkai, Balázs

arXiv.org Machine LearningMar-24-2013

The original k-means clustering method works only if the exact vectors representing the data points are known. Therefore calculating the distances from the centroids needs vector operations, since the average of abstract data points is undefined. Existing algorithms can be extended for those cases when the sole input is the distance matrix, and the exact representing vectors are unknown. This extension may be named relational k-means after a notation for a similar algorithm invented for fuzzy clustering. A method is then proposed for generalizing k-means for scenarios when the data points have absolutely no connection with a Euclidean space.

artificial intelligence, distance matrix, machine learning, (18 more...)

arXiv.org Machine Learning

1303.6001

Country: North America > United States (0.30)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Dynamic Microcluster Chains in Microtext

Robinson, Jason R. (The MITRE Corporation) | Condon, Sherri Lee (The MITRE Corporation)

AAAI ConferencesMar-21-2013

Two features of microtext that challenge language processing tools are addressed in the context of linking messages in the emergency response domain. First, the effect of very short texts on several classifiers is estimated by comparing the results when classifiers are applied to the full text of news reports vs. only the headlines. These experiments demonstrate a decrease of 5 - 20% in accuracy. A second challenging feature of microtexts is their accumulation in real time, which can be massive for sources such as Twitter. A dynamic hierarchical clustering algorithm that clusters messages as they accumulate is described, and a preliminary experiment in clustering tweets is demonstrated.

dynamic microcluster chain, microtext

AAAI Conferences

2013 AAAI Spring Symposium Series

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.53)

Add feedback

Warped Mixtures for Nonparametric Cluster Shapes

Iwata, Tomoharu, Duvenaud, David, Ghahramani, Zoubin

arXiv.org Machine LearningMar-21-2013

A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.

artificial intelligence, latent coordinate, machine learning, (16 more...)

arXiv.org Machine Learning

1206.1846

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback