AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Approximation Algorithms for K-Modes Clustering

He, Zengyou

arXiv.org Artificial IntelligenceDec-1-2009

In this paper, we study clustering with respect to the k-modes objective function, a natural formulation of clustering for categorical data. One of the main contributions of this paper is to establish the connection between k-modes and k-median, i.e., the optimum of k-median is at most twice the optimum of k-modes for the same categorical data clustering problem. Based on this observation, we derive a deterministic algorithm that achieves an approximation factor of 2. Furthermore, we prove that the distance measure in k-modes defines a metric. Hence, we are able to extend existing approximation algorithms for metric k-median to k-modes. Empirical results verify the superiority of our method.

approximation algorithm, artificial intelligence, machine learning, (2 more...)

arXiv.org Artificial Intelligence

cs/0603120

Genre: Research Report (0.40)

Technology:

Information Technology > Mathematics of Computing (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.53)

Add feedback

K-ANMI: A Mutual Information Based Clustering Algorithm for Categorical Data

He, Zengyou, Xu, Xiaofei, Deng, Shengchun

arXiv.org Artificial IntelligenceDec-1-2009

Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-ANMI, a new efficient algorithm for clustering categorical data. The k-ANMI algorithm works in a way that is similar to the popular k-means algorithm, and the goodness of clustering in each step is evaluated using a mutual information based criterion (namely, Average Normalized Mutual Information-ANMI) borrowed from cluster ensemble. Experimental results on real datasets show that k-ANMI algorithm is competitive with those state-of-art categorical data clustering algorithms with respect to clustering accuracy.

artificial intelligence, clustering algorithm, machine learning, (3 more...)

arXiv.org Artificial Intelligence

cs/0511013

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Attribute Value Weighting in K-Modes Clustering

He, Zengyou, Xu, Xaiofei, Deng, Shengchun

arXiv.org Artificial IntelligenceDec-1-2009

Categorical data clustering is an important research problem in pattern recognition and data mining. The k -modes algorithm [1] extends the k -means paradigm to cluster categorical data by using (1) a simple matching dissimilarity measure for categorical objects, (2) modes instead of means for clusters, and (3) a frequency-based method to update modes in the k -means fashion to minimize the cost function of clustering. The k -modes algorithm is widely used in real world applications due to its efficiency in dealing with large categorical database. In standard k -modes algorithm, a simple matching similarity measure is used, in which the distance is either 0 or 1. Such simple matching dissimilarity measure doesn't consider the implicit similarity relationship embedded in categorical values, which will result in a weaker intra-cluster similarity by allocating less similar objects to the cluster. To illustrate this fact, let's consider the following example shown in Fig.1. Example 1: In this artificial example, the dataset is described with 3 categorical attributes A1, A2,and A3, and there are two clusters with their modes. Assuming that we have to allocate a data object Y = [a, p, w] to either cluster 1 or cluster 2. According to the k -modes algorithm, we can assign Y to either cluster 1 or cluster 2 since these two clusters have the same mode. However, from the viewpoint of intra-cluster simila rity, it is more desirable to allocate Y to cluster 1.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

cs/0701013

Country: Asia > China (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Data spectroscopy: Eigenspaces of convolution operators and clustering

Shi, Tao, Belkin, Mikhail, Yu, Bin

arXiv.org Machine LearningNov-20-2009

This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.

artificial intelligence, eigenfunction, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1214/09-AOS700

0807.3719

Country: North America > United States > California (0.68)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

Add feedback

Learning Topology of Curves with Application to Clustering

Mobahi, Hossein (University of Illinois at Urbana Champaign) | Rao, Shankar (University of Illinois at Urbana Champaign) | Ma, Yi (University of Illinois at Urbana Champaign)

AAAI ConferencesNov-3-2009

We propose a method for learning the intrinsic topology of a point set sampled from a curve embedded in a high-dimensional ambient space. Our approach does not rely on distances in the ambient space, and thus can recover the topology of sparsely sampled curves, a situation where extant manifold learning methods are expected to fail. We formulate a loss function based on the smoothness of a curve, and derive a greedy procedure for minimizing this loss function. We compare the efficacy of our approach with representative manifold learning and hierarchical clustering methods on both real and synthetic data.

algorithm, isomap, manifold, (15 more...)

AAAI Conferences

2009 AAAI Fall Symposium Series

Country: North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Mesh Segmentation Using Laplacian Eigenvectors and Gaussian Mixtures

Sharma, Avinash (Perception Group, INRIA Grenoble Rhône-Alpes) | Horaud, Radu Patrice (Perception Group, INRIA Grenoble Rhône-Alpes) | Knossow, David (Perception Group, INRIA Grenoble Rhône-Alpes) | Lavante, Etienne von (Perception Group, INRIA Grenoble Rhône-Alpes)

AAAI ConferencesNov-3-2009

In this paper a new completely unsupervised mesh segmentation algorithm is proposed, which is based on the PCA interpretation of the Laplacian eigenvectors of the mesh and on parametric clustering using Gaussian mixtures. We analyse the geometric properties of these vectors and we devise a practical method that combines single-vector analysis with multiple-vector analysis. We attempt to characterize the projection of the graph onto each one of its eigenvectors based on PCA properties of the eigenvectors. We devise an unsupervised probabilistic method, based on one-dimensional Gaussian mixture modeling with model selection, to reveal the structure of each eigenvector. Based on this structure, we select a subset of eigenvectors among the set of the smallest non-null eigenvectors and we embed the mesh into the isometric space spanned by this selection of eigenvectors. The final clustering is performed via unsupervised classification based on learning a multi-dimensional Gaussian mixture model of the embedded graph.

artificial intelligence, eigenvector, machine learning, (18 more...)

AAAI Conferences

2009 AAAI Fall Symposium Series

Country:

Asia > Middle East > Jordan (0.05)
Europe > Poland (0.05)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.96)

Add feedback

MiPPS: A Generative Model for Multi-Manifold Clustering

Koyejo, Oluwasanmi (University of Texas at Austin) | Ghosh, Joydeep (University of Texas at Austin)

AAAI ConferencesNov-3-2009

We propose a generative model for high dimensional data consisting of intrinsically low dimensional clusters that are noisily sampled. The proposed model is a mixture of probabilistic principal surfaces (MiPPS) optimized using expectation maximization. We use a Bayesian prior on the model parameters to maximize the corresponding marginal likelihood. We also show empirically that this optimization can be biased towards a good local optimum by using our prior intuition to guide the initialization phase The proposed unsupervised algorithm naturally handles cases where the data lies on multiple connected components of a single manifold and where the component manifolds intersect. In addition to clustering, we learn a functional model for the underlying structure of each component cluster as a parameterized hyper-surface in ambient noise.This model is used to learn a global embedding that we use for visualization of the entire dataset. We demonstrate the performance of MiPPS in separating and visualizing land cover types in a hyperspectral dataset.

artificial intelligence, machine learning, manifold, (19 more...)

AAAI Conferences

2009 AAAI Fall Symposium Series

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Africa > Botswana (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Initialization Free Graph Based Clustering

Galluccio, Laurent, Michel, Olivier J. J., Comon, Pierre, Slezak, Eric, Hero, Alfred O.

arXiv.org Machine LearningSep-24-2009

This paper proposes an original approach to cluster multi-component data sets, including an estimation of the number of clusters. From the construction of a minimal spanning tree with Prim's algorithm, and the assumption that the vertices are approximately distributed according to a Poisson distribution, the number of clusters is estimated by thresholding the Prim's trajectory. The corresponding cluster centroids are then computed in order to initialize the generalized Lloyd's algorithm, also known as $K$-means, which allows to circumvent initialization problems. Some results are derived for evaluating the false positive rate of our cluster detection algorithm, with the help of approximations relevant in Euclidean spaces. Metrics used for measuring similarity between multi-dimensional data points are based on symmetrical divergences. The use of these informational divergences together with the proposed method leads to better results, compared to other clustering methods for the problem of astrophysical data processing. Some applications of this method in the multi/hyper-spectral imagery domain to a satellite view of Paris and to an image of the Mars planet are also presented. In order to demonstrate the usefulness of divergences in our problem, the method with informational divergence as similarity measure is compared with the same method using classical metrics. In the astrophysics application, we also compare the method with the spectral clustering algorithms.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

0909.4395

Country:

Europe > France (0.68)
North America > United States > California (0.46)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

Arias-Castro, Ery

arXiv.org Machine LearningSep-12-2009

In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral method.

artificial intelligence, probability, upstream oil & gas, (17 more...)

arXiv.org Machine Learning

0909.2353

Country:

Asia > Middle East > Jordan (0.24)
North America > United States > New York (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > San Diego County (0.14)

Genre: Research Report (0.82)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Ontology-Based Link Prediction in the LiveJournal Social Network

Caragea, Doina (Kansas State University) | Bahirwani, Vikas (Kansas State University) | Aljandal, Waleed (Kansas State University) | Hsu, William H. (Kansas State University)

AAAI ConferencesSep-1-2009

LiveJournal is a social network journal service with focus on user interactions. As for many other online social networks, predicting potential friendships in the LiveJournal network is a problem of great practical interest. Previous work has shown that graph features extracted from the graph associated with the network are good predictors for friendship links. However, contrary to the intuition, user data (e.g., interests shared by two users) does not always improve the predictions obtained with graph features alone. This could be due to the fact that features constructed from a large number of user declared interests cannot capture the implicit semantic of the interests. To test this hypothesis, we use a clustering approach to build an interest ontology, and explore the ability of the ontology to improve the performance of learning algorithms at predicting friendship links, when interest-based features are used alone or in combination with graph-based features. The results show that ontology-based features can help improve the performance of several machine learning classiﬁers (in particular, random forest classiﬁers) at the task of predicting links in the LiveJournal social network.

algorithm, classifier, ontology, (14 more...)

AAAI Conferences

Eighth Symposium on Abstraction, Reformulation, and Approximation

Country:

Oceania > New Zealand > North Island > Waikato (0.04)
North America > United States > Kansas > Riley County > Manhattan (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Services (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback