Clustering
Approximation Algorithms for K-Modes Clustering
In this paper, we study clustering with respect to the k-modes objective function, a natural formulation of clustering for categorical data. One of the main contributions of this paper is to establish the connection between k-modes and k-median, i.e., the optimum of k-median is at most twice the optimum of k-modes for the same categorical data clustering problem. Based on this observation, we derive a deterministic algorithm that achieves an approximation factor of 2. Furthermore, we prove that the distance measure in k-modes defines a metric. Hence, we are able to extend existing approximation algorithms for metric k-median to k-modes. Empirical results verify the superiority of our method.
K-ANMI: A Mutual Information Based Clustering Algorithm for Categorical Data
He, Zengyou, Xu, Xiaofei, Deng, Shengchun
Clustering categorical data is an integral part of data mining and has attracted much attention recently. In this paper, we present k-ANMI, a new efficient algorithm for clustering categorical data. The k-ANMI algorithm works in a way that is similar to the popular k-means algorithm, and the goodness of clustering in each step is evaluated using a mutual information based criterion (namely, Average Normalized Mutual Information-ANMI) borrowed from cluster ensemble. Experimental results on real datasets show that k-ANMI algorithm is competitive with those state-of-art categorical data clustering algorithms with respect to clustering accuracy.
Attribute Value Weighting in K-Modes Clustering
He, Zengyou, Xu, Xaiofei, Deng, Shengchun
Categorical data clustering is an important research problem in pattern recognition and data mining. The k -modes algorithm [1] extends the k -means paradigm to cluster categorical data by using (1) a simple matching dissimilarity measure for categorical objects, (2) modes instead of means for clusters, and (3) a frequency-based method to update modes in the k -means fashion to minimize the cost function of clustering. The k -modes algorithm is widely used in real world applications due to its efficiency in dealing with large categorical database. In standard k -modes algorithm, a simple matching similarity measure is used, in which the distance is either 0 or 1. Such simple matching dissimilarity measure doesn't consider the implicit similarity relationship embedded in categorical values, which will result in a weaker intra-cluster similarity by allocating less similar objects to the cluster. To illustrate this fact, let's consider the following example shown in Fig.1. Example 1: In this artificial example, the dataset is described with 3 categorical attributes A1, A2,and A3, and there are two clusters with their modes. Assuming that we have to allocate a data object Y = [a, p, w] to either cluster 1 or cluster 2. According to the k -modes algorithm, we can assign Y to either cluster 1 or cluster 2 since these two clusters have the same mode. However, from the viewpoint of intra-cluster simila rity, it is more desirable to allocate Y to cluster 1.
Data spectroscopy: Eigenspaces of convolution operators and clustering
Shi, Tao, Belkin, Mikhail, Yu, Bin
This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.
Learning Topology of Curves with Application to Clustering
Mobahi, Hossein (University of Illinois at Urbana Champaign) | Rao, Shankar (University of Illinois at Urbana Champaign) | Ma, Yi (University of Illinois at Urbana Champaign)
We propose a method for learning the intrinsic topology of a point set sampled from a curve embedded in a high-dimensional ambient space. Our approach does not rely on distances in the ambient space, and thus can recover the topology of sparsely sampled curves, a situation where extant manifold learning methods are expected to fail. We formulate a loss function based on the smoothness of a curve, and derive a greedy procedure for minimizing this loss function. We compare the efficacy of our approach with representative manifold learning and hierarchical clustering methods on both real and synthetic data.
Mesh Segmentation Using Laplacian Eigenvectors and Gaussian Mixtures
Sharma, Avinash (Perception Group, INRIA Grenoble Rhône-Alpes) | Horaud, Radu Patrice (Perception Group, INRIA Grenoble Rhône-Alpes) | Knossow, David (Perception Group, INRIA Grenoble Rhône-Alpes) | Lavante, Etienne von (Perception Group, INRIA Grenoble Rhône-Alpes)
In this paper a new completely unsupervised mesh segmentation algorithm is proposed, which is based on the PCA interpretation of the Laplacian eigenvectors of the mesh and on parametric clustering using Gaussian mixtures. We analyse the geometric properties of these vectors and we devise a practical method that combines single-vector analysis with multiple-vector analysis. We attempt to characterize the projection of the graph onto each one of its eigenvectors based on PCA properties of the eigenvectors. We devise an unsupervised probabilistic method, based on one-dimensional Gaussian mixture modeling with model selection, to reveal the structure of each eigenvector. Based on this structure, we select a subset of eigenvectors among the set of the smallest non-null eigenvectors and we embed the mesh into the isometric space spanned by this selection of eigenvectors. The final clustering is performed via unsupervised classification based on learning a multi-dimensional Gaussian mixture model of the embedded graph.
MiPPS: A Generative Model for Multi-Manifold Clustering
Koyejo, Oluwasanmi (University of Texas at Austin) | Ghosh, Joydeep (University of Texas at Austin)
We propose a generative model for high dimensional data consisting of intrinsically low dimensional clusters that are noisily sampled. The proposed model is a mixture of probabilistic principal surfaces (MiPPS) optimized using expectation maximization. We use a Bayesian prior on the model parameters to maximize the corresponding marginal likelihood. We also show empirically that this optimization can be biased towards a good local optimum by using our prior intuition to guide the initialization phase The proposed unsupervised algorithm naturally handles cases where the data lies on multiple connected components of a single manifold and where the component manifolds intersect. In addition to clustering, we learn a functional model for the underlying structure of each component cluster as a parameterized hyper-surface in ambient noise.This model is used to learn a global embedding that we use for visualization of the entire dataset. We demonstrate the performance of MiPPS in separating and visualizing land cover types in a hyperspectral dataset.
Initialization Free Graph Based Clustering
Galluccio, Laurent, Michel, Olivier J. J., Comon, Pierre, Slezak, Eric, Hero, Alfred O.
This paper proposes an original approach to cluster multi-component data sets, including an estimation of the number of clusters. From the construction of a minimal spanning tree with Prim's algorithm, and the assumption that the vertices are approximately distributed according to a Poisson distribution, the number of clusters is estimated by thresholding the Prim's trajectory. The corresponding cluster centroids are then computed in order to initialize the generalized Lloyd's algorithm, also known as $K$-means, which allows to circumvent initialization problems. Some results are derived for evaluating the false positive rate of our cluster detection algorithm, with the help of approximations relevant in Euclidean spaces. Metrics used for measuring similarity between multi-dimensional data points are based on symmetrical divergences. The use of these informational divergences together with the proposed method leads to better results, compared to other clustering methods for the problem of astrophysical data processing. Some applications of this method in the multi/hyper-spectral imagery domain to a satellite view of Paris and to an image of the Mars planet are also presented. In order to demonstrate the usefulness of divergences in our problem, the method with informational divergence as similarity measure is compared with the same method using classical metrics. In the astrophysics application, we also compare the method with the spectral clustering algorithms.
Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions
In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral method.
Ontology-Based Link Prediction in the LiveJournal Social Network
Caragea, Doina (Kansas State University) | Bahirwani, Vikas (Kansas State University) | Aljandal, Waleed (Kansas State University) | Hsu, William H. (Kansas State University)
LiveJournal is a social network journal service with focus on user interactions. As for many other online social networks, predicting potential friendships in the LiveJournal network is a problem of great practical interest. Previous work has shown that graph features extracted from the graph associated with the network are good predictors for friendship links. However, contrary to the intuition, user data (e.g., interests shared by two users) does not always improve the predictions obtained with graph features alone. This could be due to the fact that features constructed from a large number of user declared interests cannot capture the implicit semantic of the interests. To test this hypothesis, we use a clustering approach to build an interest ontology, and explore the ability of the ontology to improve the performance of learning algorithms at predicting friendship links, when interest-based features are used alone or in combination with graph-based features. The results show that ontology-based features can help improve the performance of several machine learning classifiers (in particular, random forest classifiers) at the task of predicting links in the LiveJournal social network.