Goto

Collaborating Authors

 Clustering


A Study of the Spatio-Temporal Correlations in Mobile Calls Networks

arXiv.org Machine Learning

For the last few years, the amount of data has significantly increased in the companies. It is the reason why data analysis methods have to evolve to meet new demands. In this article, we introduce a practical analysis of a large database from a telecommunication operator. The problem is to segment a territory and characterize the retrieved areas owing to their inhabitant behavior in terms of mobile telephony. We have call detail records collected during five months in France. We propose a two stages analysis. The first one aims at grouping source antennas which originating calls are similarly distributed on target antennas and conversely for target antenna w.r.t. source antenna. A geographic projection of the data is used to display the results on a map of France. The second stage discretizes the time into periods between which we note changes in distributions of calls emerging from the clusters of source antennas. This enables an analysis of temporal changes of inhabitants behavior in every area of the country.


Robust Subspace Clustering via Tighter Rank Approximation

arXiv.org Machine Learning

Matrix rank minimization problem is in general NP-hard. The nuclear norm is used to substitute the rank function in many recent studies. Nevertheless, the nuclear norm approximation adds all singular values together and the approximation error may depend heavily on the magnitudes of singular values. This might restrict its capability in dealing with many practical problems. In this paper, an arctangent function is used as a tighter approximation to the rank function. We use it on the challenging subspace clustering problem. For this nonconvex minimization problem, we develop an effective optimization procedure based on a type of augmented Lagrange multipliers (ALM) method. Extensive experiments on face clustering and motion segmentation show that the proposed method is effective for rank approximation.


A Unified Framework for Representation-based Subspace Clustering of Out-of-sample and Large-scale Data

arXiv.org Machine Learning

Under the framework of spectral clustering, the key of subspace clustering is building a similarity graph which describes the neighborhood relations among data points. Some recent works build the graph using sparse, low-rank, and $\ell_2$-norm-based representation, and have achieved state-of-the-art performance. However, these methods have suffered from the following two limitations. First, the time complexities of these methods are at least proportional to the cube of the data size, which make those methods inefficient for solving large-scale problems. Second, they cannot cope with out-of-sample data that are not used to construct the similarity graph. To cluster each out-of-sample datum, the methods have to recalculate the similarity graph and the cluster membership of the whole data set. In this paper, we propose a unified framework which makes representation-based subspace clustering algorithms feasible to cluster both out-of-sample and large-scale data. Under our framework, the large-scale problem is tackled by converting it as out-of-sample problem in the manner of "sampling, clustering, coding, and classifying". Furthermore, we give an estimation for the error bounds by treating each subspace as a point in a hyperspace. Extensive experimental results on various benchmark data sets show that our methods outperform several recently-proposed scalable methods in clustering large-scale data set.


Fast Landmark Subspace Clustering

arXiv.org Machine Learning

Kernel methods obtain superb performance in terms of accuracy for various machine learning tasks since they can effectively extract nonlinear relations. However, their time complexity can be rather large especially for clustering tasks. In this paper we define a general class of kernels that can be easily approximated by randomization. These kernels appear in various applications, in particular, traditional spectral clustering, landmark-based spectral clustering and landmark-based subspace clustering. We show that for $n$ data points from $K$ clusters with $D$ landmarks, the randomization procedure results in an algorithm of complexity $O(KnD)$. Furthermore, we bound the error between the original clustering scheme and its randomization. To illustrate the power of this framework, we propose a new fast landmark subspace (FLS) clustering algorithm. Experiments over synthetic and real datasets demonstrate the superior performance of FLS in accelerating subspace clustering with marginal sacrifice of accuracy.


Fully adaptive density-based clustering

arXiv.org Machine Learning

The clusters of a distribution are often defined by the connected components of a density level set. However, this definition depends on the user-specified level. We address this issue by proposing a simple, generic algorithm, which uses an almost arbitrary level set estimator to estimate the smallest level at which there are more than one connected components. In the case where this algorithm is fed with histogram-based level set estimates, we provide a finite sample analysis, which is then used to show that the algorithm consistently estimates both the smallest level and the corresponding connected components. We further establish rates of convergence for the two estimation problems, and last but not least, we present a simple, yet adaptive strategy for determining the width-parameter of the involved density estimator in a data-depending way.


Spectral Convergence Rate of Graph Laplacian

arXiv.org Machine Learning

Laplacian Eigenvectors of the graph constructed from a data set are used in many spectral manifold learning algorithms such as diffusion maps and spectral clustering. Given a graph constructed from a random sample of a d-dimensional compact submanifold M in R D, we establish the spectral convergence rate of the graph Laplacian. It implies the consistency of the spectral clustering algorithm via a standard perturbation argument. A simple numerical study indicates the necessity of a denoising step before applying spectral algorithms. 1. Introduction High-dimensional data appears naturally in real-world applications. A common assumption is that the data resides on a low-dimensional manifold.


Adaptive Mixtures of Factor Analyzers

arXiv.org Machine Learning

A mixture of factor analyzers is a semi-parametric density estimator that generalizes the well-known mixtures of Gaussians model by allowing each Gaussian in the mixture to be represented in a different lower-dimensional manifold. This paper presents a robust and parsimonious model selection algorithm for training a mixture of factor analyzers, carrying out simultaneous clustering and locally linear, globally nonlinear dimensionality reduction. Permitting different number of factors per mixture component, the algorithm adapts the model complexity to the data complexity. We compare the proposed algorithm with related automatic model selection algorithms on a number of benchmarks. The results indicate the effectiveness of this fast and robust approach in clustering, manifold learning and class-conditional modeling.


Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions

arXiv.org Machine Learning

We propose a novel method for multiple clustering that assumes a co-clustering structure (partitions in both rows and columns of the data matrix) in each view. The new method is applicable to high-dimensional data. It is based on a nonparametric Bayesian approach in which the number of views and the number of feature-/subject clusters are inferred in a data-driven manner. We simultaneously model different distribution families, such as Gaussian, Poisson, and multinomial distributions in each cluster block. This makes our method applicable to datasets consisting of both numerical and categorical variables, which biomedical data typically do. Clustering solutions are based on variational inference with mean field approximation. We apply the proposed method to synthetic and real data, and show that our method outperforms other multiple clustering methods both in recovering true cluster structures and in computation time. Finally, we apply our method to a depression dataset with no true cluster structure available, from which useful inferences are drawn about possible clustering structures of the data.


A Bayesian alternative to mutual information for the hierarchical clustering of dependent random variables

arXiv.org Machine Learning

The use of mutual information as a similarity measure in agglomerative hierarchical clustering (AHC) raises an important issue: some correction needs to be applied for the dimensionality of variables. In this work, we formulate the decision of merging dependent multivariate normal variables in an AHC procedure as a Bayesian model comparison. We found that the Bayesian formulation naturally shrinks the empirical covariance matrix towards a matrix set a priori (e.g., the identity), provides an automated stopping rule, and corrects for dimensionality using a term that scales up the measure as a function of the dimensionality of the variables. Also, the resulting log Bayes factor is asymptotically proportional to the plug-in estimate of mutual information, with an additive correction for dimensionality in agreement with the Bayesian information criterion. We investigated the behavior of these Bayesian alternatives (in exact and asymptotic forms) to mutual information on simulated and real data. An encouraging result was first derived on simulations: the hierarchical clustering based on the log Bayes factor outperformed off-the-shelf clustering techniques as well as raw and normalized mutual information in terms of classification accuracy. On a toy example, we found that the Bayesian approaches led to results that were similar to those of mutual information clustering techniques, with the advantage of an automated thresholding. On real functional magnetic resonance imaging (fMRI) datasets measuring brain activity, it identified clusters consistent with the established outcome of standard procedures. On this application, normalized mutual information had a highly atypical behavior, in the sense that it systematically favored very large clusters. These initial experiments suggest that the proposed Bayesian alternatives to mutual information are a useful new tool for hierarchical clustering.


Clustering Noisy Signals with Structured Sparsity Using Time-Frequency Representation

arXiv.org Machine Learning

Clustering of high-dimensional signals, sequences or functional data is a common task that arises in many domains [18, 19]. Such data come up in diverse fields, as in speech analysis, genomics, mass spectrometry, MRI or EEG measurements, and many more. Clustering seeks to partition data into groups with high overall similarity between members (instances) of the same group and dissimilarity to members of other groups. For time-series signals, this means partitioning the instances into groups of similarly behaving functions over time, where the measure of similarity is crucial and often application-specific. In many real-world scenarios, signals are high-dimensional (such as in genomics), noisy (as in low-quality speech recordings), and exhibit non-stationary behavior: for example peaks and other non-smooth local patterns, or changes in frequency over time.