Goto

Collaborating Authors

 Clustering


Introduction to Cross-Entropy Clustering The R Package CEC

arXiv.org Machine Learning

The R Package CEC performs clustering based on the cross-entropy clustering (CEC) method, which was recently developed with the use of information theory. The main advantage of CEC is that it combines the speed and simplicity of $k$-means with the ability to use various Gaussian mixture models and reduce unnecessary clusters. In this work we present a practical tutorial to CEC based on the R Package CEC. Functions are provided to encompass the whole process of clustering.


Time Series Clustering via Community Detection in Networks

arXiv.org Machine Learning

In this paper, we propose a technique for time series clustering using community detection in complex networks. Firstly, we present a method to transform a set of time series into a network using different distance functions, where each time series is represented by a vertex and the most similar ones are connected. Then, we apply community detection algorithms to identify groups of strongly connected vertices (called a community) and, consequently, identify time series clusters. Still in this paper, we make a comprehensive analysis on the influence of various combinations of time series distance functions, network generation methods and community detection techniques on clustering results. Experimental study shows that the proposed network-based approach achieves better results than various classic or up-to-date clustering techniques under consideration. Statistical tests confirm that the proposed method outperforms some classic clustering algorithms, such as $k$-medoids, diana, median-linkage and centroid-linkage in various data sets. Interestingly, the proposed method can effectively detect shape patterns presented in time series due to the topological structure of the underlying network constructed in the clustering process. At the same time, other techniques fail to identify such patterns. Moreover, the proposed method is robust enough to group time series presenting similar pattern but with time shifts and/or amplitude variations. In summary, the main point of the proposed method is the transformation of time series from time-space domain to topological domain. Therefore, we hope that our approach contributes not only for time series clustering, but also for general time series analysis tasks.


Robust Subspace Clustering via Smoothed Rank Approximation

arXiv.org Machine Learning

Matrix rank minimizing subject to affine constraints arises in many application areas, ranging from signal processing to machine learning. Nuclear norm is a convex relaxation for this problem which can recover the rank exactly under some restricted and theoretically interesting conditions. However, for many real-world applications, nuclear norm approximation to the rank function can only produce a result far from the optimum. To seek a solution of higher accuracy than the nuclear norm, in this paper, we propose a rank approximation based on Logarithm-Determinant. We consider using this rank approximation for subspace clustering application. Our framework can model different kinds of errors and noise. Effective optimization strategy is developed with theoretical guarantee to converge to a stationary point. The proposed method gives promising results on face clustering and motion segmentation tasks compared to the state-of-the-art subspace clustering algorithms.


A model selection approach for clustering a multinomial sequence with non-negative factorization

arXiv.org Machine Learning

We consider a problem of clustering a sequence of multinomial observations by way of a model selection criterion. We propose a form of a penalty term for the model selection procedure. Our approach subsumes both the conventional AIC and BIC criteria but also extends the conventional criteria in a way that it can be applicable also to a sequence of sparse multinomial observations, where even within a same cluster, the number of multinomial trials may be different for different observations. In addition, as a preliminary estimation step to maximum likelihood estimation, and more generally, to maximum $L_{q}$ estimation, we propose to use reduced rank projection in combination with non-negative factorization. We motivate our approach by showing that our model selection criterion and preliminary estimation step yield consistent estimates under simplifying assumptions. We also illustrate our approach through numerical experiments using real and simulated data.


A variational approach to the consistency of spectral clustering

arXiv.org Machine Learning

This paper establishes the consistency of spectral approaches to data clustering. We consider clustering of point clouds obtained as samples of a ground-truth measure. A graph representing the point cloud is obtained by assigning weights to edges based on the distance between the points they connect. We investigate the spectral convergence of both unnormalized and normalized graph Laplacians towards the appropriate operators in the continuum domain. We obtain sharp conditions on how the connectivity radius can be scaled with respect to the number of sample points for the spectral convergence to hold. We also show that the discrete clusters obtained via spectral clustering converge towards a continuum partition of the ground truth measure. Such continuum partition minimizes a functional describing the continuum analogue of the graph-based spectral partitioning. Our approach, based on variational convergence, is general and flexible.


Spectral Clustering and Block Models: A Review And A New Algorithm

arXiv.org Machine Learning

Since its introduction in [15], spectral analysis of various matrices associated to groups has become one of the most widely used clustering techniques in statistics and machine learning. In the context of unlabeled graphs, a number of methods, all of which come under the broad heading of spectral clustering have been proposed. These methods based on spectral analysis of adjacency matrices or some derived matrix such as one of the Laplacians ([31], [28], [23], [29], [32]) have been studied in connection with their effectiveness in identifying members of blocks in exchangeable graph block models. In this paper after introducing the methods and models, we intend to review some of the literature.


Dimension reduction for model-based clustering

arXiv.org Machine Learning

We introduce a dimension reduction method for visualizing the clustering structure obtained from a finite mixture of Gaussian densities. Information on the dimension reduction subspace is obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. The proposed method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the cluster structure contained in the data. Observations may then be projected onto such a reduced subspace, thus providing summary plots which help to visualize the clustering structure. These plots can be particularly appealing in the case of high-dimensional data and noisy structure. The new constructed variables capture most of the clustering information available in the data, and they can be further reduced to improve clustering performance. We illustrate the approach on both simulated and real data sets.


Universal Approximation of Edge Density in Large Graphs

arXiv.org Machine Learning

With the recent availability of much network data, such as world wide web, social networks, phone call networks, science collaboration graphs [1], [2], there is a renewed interest for the graph partitioning problem, especially for the automatic discovery of community structures in large networks [3], [4], [5]. Beyond clustering approaches, coclustering approaches aim at summarizing the relation between two entities in a many-to-many relationship. Such a relation can be represented as a graph, where the source and target vertices represent entities and the edges stand for relations between entities. A coclustering model provides a summary of a graph by grouping source vertices and target vertices. For example, in market analysis, the source vertices of the graph represent customers, the target vertices represent products and there is one edge each time a customer has purchased a product. A coclustering model summarizes the dataset by grouping customers that have purchased approximately the same products and grouping products that have been purchased by approximately the same customers. Coclustering models have been applied to many other domains, such as information retrieval (the entities are documents and their words in a text corpus), web log analysis (cookies and their visited web pages), web structure analysis (web pages with hyperlinks between them) or telecommunication network (the call detail records stand for the edges in a call graph between a caller and a called party). All these real-world graphs are directed multigraphs, meaning that two entities may be linked by multi-edges. We aim to summarize and discover insightful patterns in such graphs, using a method with the desired following properties: 1) Robustness, to avoid detecting spurious patterns in case of noisy data.


Unsupervised Learning in Genome Informatics

arXiv.org Machine Learning

With different genomes available, unsupervised learning algorithms are essential in learning genome-wide biological insights. Especially, the functional characterization of different genomes is essential for us to understand lives. In this book chapter, we review the state-of-the-art unsupervised learning algorithms for genome informatics from DNA to MicroRNA. DNA (DeoxyriboNucleic Acid) is the basic component of genomes. A significant fraction of DNA regions (transcription factor binding sites) are bound by proteins (transcription factors) to regulate gene expression at different development stages in different tissues. To fully understand genetics, it is necessary of us to apply unsupervised learning algorithms to learn and infer those DNA regions. Here we review several unsupervised learning methods for deciphering the genome-wide patterns of those DNA regions. MicroRNA (miRNA), a class of small endogenous non-coding RNA (RiboNucleic acid) species, regulate gene expression post-transcriptionally by forming imperfect base-pair with the target sites primarily at the 3$'$ untranslated regions of the messenger RNAs. Since the 1993 discovery of the first miRNA \emph{let-7} in worms, a vast amount of studies have been dedicated to functionally characterizing the functional impacts of miRNA in a network context to understand complex diseases such as cancer. Here we review several representative unsupervised learning frameworks on inferring miRNA regulatory network by exploiting the static sequence-based information pertinent to the prior knowledge of miRNA targeting and the dynamic information of miRNA activities implicated by the recently available large data compendia, which interrogate genome-wide expression profiles of miRNAs and/or mRNAs across various cell conditions.


IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family

arXiv.org Machine Learning

Previously, we proposed a physically-inspired method to construct data points into an effective in-tree (IT) structure, in which the underlying cluster structure in the dataset is well revealed. Although there are some edges in the IT structure requiring to be removed, such undesired edges are generally distinguishable from other edges and thus are easy to be determined. For instance, when the IT structures for the 2-dimensional (2D) datasets are graphically presented, those undesired edges can be easily spotted and interactively determined. However, in practice, there are many datasets that do not lie in the 2D Euclidean space, thus their IT structures cannot be graphically presented. But if we can effectively map those IT structures into a visualized space in which the salient features of those undesired edges are preserved, then the undesired edges in the IT structures can still be visually determined in a visualization environment. Previously, this purpose was reached by our method called IT-map. The outstanding advantage of IT-map is that clusters can still be found even with the so-called crowding problem in the embedding. In this paper, we propose another method, called IT-Dendrogram, to achieve the same goal through an effective combination of the IT structure and the single link hierarchical clustering (SLHC) method. Like IT-map, IT-Dendrogram can also effectively represent the IT structures in a visualization environment, whereas using another form, called the Dendrogram. IT-Dendrogram can serve as another visualization method to determine the undesired edges in the IT structures and thus benefit the IT-based clustering analysis. This was demonstrated on several datasets with different shapes, dimensions, and attributes. Unlike IT-map, IT-Dendrogram can always avoid the crowding problem, which could help users make more reliable cluster analysis in certain problems.