Goto

Collaborating Authors

 Clustering


Mixtures of Shifted Asymmetric Laplace Distributions

arXiv.org Machine Learning

A mixture of shifted asymmetric Laplace distributions is introduced and used for clustering and classification. A variant of the EM algorithm is developed for parameter estimation by exploiting the relationship with the general inverse Gaussian distribution. This approach is mathematically elegant and relatively computationally straightforward. Our novel mixture modelling approach is demonstrated on both simulated and real data to illustrate clustering and classification applications. In these analyses, our mixture of shifted asymmetric Laplace distributions performs favourably when compared to the popular Gaussian approach. This work, which marks an important step in the non-Gaussian model-based clustering and classification direction, concludes with discussion as well as suggestions for future work.


Optimal Time Bounds for Approximate Clustering

arXiv.org Machine Learning

Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the emph{k-median} objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call emph{successive sampling} that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klog{n/k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Omega(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1/100) probability. Thus we establish a tight time bound of Theta(nk) for the k-median problem for a wide range of values of k. The best previous upper bound for the problem was O(nk), where the O-notation hides polylogarithmic factors in n and k. The best previous lower bound of O(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.


An Information-Theoretic External Cluster-Validity Measure

arXiv.org Machine Learning

In this paper we propose a measure of clustering quality or accuracy that is appropriate in situations where it is desirable to evaluate a clustering algorithm by somehow comparing the clusters it produces with ``ground truth' consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is confidence. Such measures are refered to as ``external'. Our measure also has the characteristic of allowing clusterings with different numbers of clusters to be compared in a quantitative and principled way. Our evaluation scheme quantitatively measures how useful the cluster labels of the patterns are as predictors of their class labels. In cases where all clusterings to be compared have the same number of clusters, the measure is equivalent to the mutual information between the cluster labels and the class labels. In cases where the numbers of clusters are different, however, it computes the reduction in the number of bits that would be required to encode (compress) the class labels if both the encoder and decoder have free acccess to the cluster labels. To achieve this encoding the estimated conditional probabilities of the class labels given the cluster labels must also be encoded. These estimated probabilities can be seen as a model for the class labels and their associated code length as a model cost.


Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning

arXiv.org Artificial Intelligence

Many problems in sequential decision making and stochastic control often have natural multiscale structure: sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure, particularly beyond a single level of abstraction, has remained a longstanding challenge. We describe a fast multiscale procedure for repeatedly compressing, or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of sub-problems at different scales is automatically determined. Coarsened MDPs are themselves independent, deterministic MDPs, and may be solved using existing algorithms. The multiscale representation delivered by this procedure decouples sub-tasks from each other and can lead to substantial improvements in convergence rates both locally within sub-problems and globally across sub-problems, yielding significant computational savings. A second fundamental aspect of this work is that these multiscale decompositions yield new transfer opportunities across different problems, where solutions of sub-tasks at different levels of the hierarchy may be amenable to transfer to new problems. Localized transfer of policies and potential operators at arbitrary scales is emphasized. Finally, we demonstrate compression and transfer in a collection of illustrative domains, including examples involving discrete and continuous statespaces. Keywords: Markov decision processes, hierarchical reinforcement learning, transfer, multiscale analysis.


Overlapping clustering based on kernel similarity metric

arXiv.org Machine Learning

Producing overlapping schemes is a major issue in clustering. Recent proposed overlapping methods relies on the search of an optimal covering and are based on different metrics, such as Euclidean distance and I-Divergence, used to measure closeness between observations. In this paper, we propose the use of another measure for overlapping clustering based on a kernel similarity metric .We also estimate the number of overlapped clusters using the Gram matrix. Experiments on both Iris and EachMovie datasets show the correctness of the estimation of number of clusters and show that measure based on kernel similarity metric improves the precision, recall and f-measure in overlapping clustering.


Classification Recouvrante Bas\'ee sur les M\'ethodes \`a Noyau

arXiv.org Machine Learning

Overlapping clustering problem is an important learning issue in which clusters are not mutually exclusive and each object may belongs simultaneously to several clusters. This paper presents a kernel based method that produces overlapping clusters on a high feature space using mercer kernel techniques to improve separability of input patterns. The proposed method, called OKM-K(Overlapping $k$-means based kernel method), extends OKM (Overlapping $k$-means) method to produce overlapping schemes. Experiments are performed on overlapping dataset and empirical results obtained with OKM-K outperform results obtained with OKM.


A LASSO-Penalized BIC for Mixture Model Selection

arXiv.org Machine Learning

A model-based clustering approach assumes that each component or some combination of components corresponds to a cluster. When fitting the model in (1), the main task is to decide the number of components G. Titterington et al. (1985), McLachan and Basford (1988) and McLachan and Peel (2002) extensively reviewed mixture models, with a focus on Gaussian mixture models. Fraley and Raftery (2002) presented a review of work on Gaussian mixtures with a focus on clustering, discriminant analysis, and density estimation. They discuss a family of Gaussian mixture models, which arises from the imposition of constraints upon an eigen-decomposition of the component covariance structure. The family of mixture models they discuss, known as MCLUST, is actually a subset of the Gaussian parsimonious clustering models (GPCMs) of Celeux and Govaert (1995). When using the MCLUST models, one must choose the appropriate member of the family, i.e., the covariance structure, in addition to deciding the number of components G. Ghahramani and Hinton (1997) introduced a mixture of factor analyzers model, which was further developed by Tipping and Bishop (1999) and McLachlan and Peel (2000).


Visualization and clustering by 3D cellular automata: Application to unstructured data

arXiv.org Artificial Intelligence

Given the limited performance of 2D cellular automata in terms of space when the number of documents increases and in terms of visualization clusters, our motivation was to experiment these cellular automata by increasing the size to view the impact of size on quality of results. The representation of textual data was carried out by a vector model whose components are derived from the overall balancing of the used corpus Term Frequency - Inverse Document Frequency (TF - IDF).The WorldNet thesaurus has been used to address the problem of the lemmatization of the words because the representation used in this study is that of the bags of words. Another independent method of the language was used to represent textual records is that of the n-grams. Several measures of similarity have been tested. To validate the classification we have used two measures of assessment based on the recall and precision (f-measure and entropy). The results are promising and confirm the idea to increase the dimension to the problem of the spatiality of the classes. The results obtained in terms of purity class (ie the minimum value of entropy) shows that the number of documents over longer believes the results are better for 3D cellular automata, which was not obvious to 2D the dimension. In terms of spatial navigation, cellular automata provide very good 3D performance visualization than 2D cellular automata.


Data Clustering via Principal Direction Gap Partitioning

arXiv.org Machine Learning

Data clustering has various applications in a wide variety of fields ranging from social and biological sciences, to business, statistics, information retrieval, machine learning and data mining. Clustering refers to the process of grouping data based only on information found in the data which describes its characteristics and relationships. Although humans are generally very good at discovering patterns and classifying objects, clustering algorithms are able to discern similarities in data even when humans are not [6]. The main focus of our research has been document clustering, but we will demonstrate that our methods also work nicely on scientific data. In this paper, we propose an adaptation of the clustering algorithm known as Principal Direction Divisive Partitioning (PDDP) developed by Daniel Boley in [2] which is based Principal Components Analysis (PCA). PCA involves the eigenvector decomposition of a data covariance matrix, or equivalently a singular value decomposition (SVD) of a data matrix after mean centering. The name of our adaptation, Principal Direction Gap Partitioning (PDGP), borrows most of its name from PDDP as it follows many of the same steps that PDDP follows. The word "gap" replaces the word "divisive" in reference to how the algorithm splits data along natural gaps at each step. This concept will be further developed in the following sections, but it should be noted that PDGP is still a divisive algorithm in the same way that PDDP is.


Spectral Clustering: An empirical study of Approximation Algorithms and its Application to the Attrition Problem

arXiv.org Machine Learning

Spectral clustering is a now well-known method for clustering which utilizes the spectrum of the data similarity matrix to perform this separation. Since the method relies on solving an eigenvector problem, it is computationally expensive for large datasets. T o overcome this constraint, approximation methods have been developed which aim to reduce running time while maintaining accurate classification. In this article, we summarize and experimentally evaluate several approximation methods for spectral clustering. From an applications standpoint, we employ spectral clustering to solve the so-called attrition problem, where one aims to identify from a set of employees those who are likely to voluntarily leave the company from those who are not. Our study sheds light on the empirical performance of existing approximate spectral clustering methods and shows the applicability of these methods in an important business optimization related problem. Clustering or cluster analysis addresses the problem of separating a set of objects into clusters so that objects within each cluster are more similar to each other than to objects in different clusters. The clustering problem has become ubiquitous in data mining and machine learning with applications ranging from image processing to bioinformatics. What one means by clustering, and the type of clustering desired is application dependent. For example, one may wish to segment an image such as that in Figure 1 (a)-(b). In medical imaging, segmentation may aid in the identification of tumors, assist physicians in surgery and separate anatomical structures. Computer vision applications utilize clustering methods to identify foreign objects in surveillance images or detect road signs for computer piloted vehicles. In statistical analysis, the objects to be clustered may represent individuals in a population viewed as a vector of personal attributes. For example, we will consider the attrition problem: from a dataset of employees one wishes to identify which cluster of employees are likely to voluntarily leave the company and which are not.