Goto

Collaborating Authors

 Clustering


Graph Construction for Learning with Unbalanced Data

arXiv.org Machine Learning

Unbalanced data arises in many learning tasks such as clustering of multi-class data, hierarchical divisive clustering and semisupervised learning. Graph-based approaches are popular tools for these problems. Graph construction is an important aspect of graph-based learning. We show that graph-based algorithms can fail for unbalanced data for many popular graphs such as k-NN, \epsilon-neighborhood and full-RBF graphs. We propose a novel graph construction technique that encodes global statistical information into node degrees through a ranking scheme. The rank of a data sample is an estimate of its p-value and is proportional to the total number of data samples with smaller density. This ranking scheme serves as a surrogate for density; can be reliably estimated; and indicates whether a data sample is close to valleys/modes. This rank-modulated degree(RMD) scheme is able to significantly sparsify the graph near valleys and provides an adaptive way to cope with unbalanced data. We then theoretically justify our method through limit cut analysis. Unsupervised and semi-supervised experiments on synthetic and real data sets demonstrate the superiority of our method.


Multi-scale Mining of fMRI data with Hierarchical Structured Sparsity

arXiv.org Machine Learning

Inverse inference, or "brain reading", is a recent paradigm for analyzing functional magnetic resonance imaging (fMRI) data, based on pattern recognition and statistical learning. By predicting some cognitive variables related to brain activation maps, this approach aims at decoding brain activity. Inverse inference takes into account the multivariate information between voxels and is currently the only way to assess how precisely some cognitive information is encoded by the activity of neural populations within the whole brain. However, it relies on a prediction function that is plagued by the curse of dimensionality, since there are far more features than samples, i.e., more voxels than fMRI volumes. To address this problem, different methods have been proposed, such as, among others, univariate feature selection, feature agglomeration and regularization techniques. In this paper, we consider a sparse hierarchical structured regularization. Specifically, the penalization we use is constructed from a tree that is obtained by spatially-constrained agglomerative clustering. This approach encodes the spatial structure of the data at different scales into the regularization, which makes the overall prediction procedure more robust to inter-subject variability. The regularization used induces the selection of spatially coherent predictive brain regions simultaneously at different scales. We test our algorithm on real data acquired to study the mental representation of objects, and we show that the proposed algorithm not only delineates meaningful brain regions but yields as well better prediction accuracy than reference methods.


Information-Maximization Clustering based on Squared-Loss Mutual Information

arXiv.org Machine Learning

Information-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve non-convex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative information-maximization clustering method based on a squared-loss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach.


Spectral clustering based on local linear approximations

arXiv.org Machine Learning

In the context of clustering, we assume a generative model where each cluster is the result of sampling points in the neighborhood of an embedded smooth surface; the sample may be contaminated with outliers, which are modeled as points sampled in space away from the clusters. We consider a prototype for a higher-order spectral clustering method based on the residual from a local linear approximation. We obtain theoretical guarantees for this algorithm and show that, in terms of both separation and robustness to outliers, it outperforms the standard spectral clustering algorithm (based on pairwise distances) of Ng, Jordan and Weiss (NIPS '01). The optimal choice for some of the tuning parameters depends on the dimension and thickness of the clusters. We provide estimators that come close enough for our theoretical purposes. We also discuss the cases of clusters of mixed dimensions and of clusters that are generated from smoother surfaces. In our experiments, this algorithm is shown to outperform pairwise spectral clustering on both simulated and real data.


Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement

arXiv.org Machine Learning

In areas such as search, matching, retrieval and general data analysis, massive increase in data requires new methods that can cope well with the explosion in volume and dimensionality of the available data. In this work, the Baire metric, which is furthermore an ultrametric, is used to induce a hierarchy and in turn to support clustering, matching and other operations. Arising directly out of the Baire distance is an ultrametric tree, which also can be seen as a tree that hierarchically clusters data. This presents a number of advantages when storing and retrieving data. When the data source is in numerical form this ultrametric tree can be used as an index structure making matching and search, and thus retrieval, much easier. The clusters can be associated with hash keys, that is to say, the cluster members can be mapped onto "bins" or "buckets".


A fast and recursive algorithm for clustering large datasets with $k$-medians

arXiv.org Machine Learning

Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the $k$-means algorithm, a new class of recursive stochastic gradient algorithms designed for the $k$-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as $k$-means, trimmed $k$-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.


A Nonparametric Frequency Domain EM Algorithm for Time Series Classification with Applications to Spike Sorting and Macro-Economics

arXiv.org Machine Learning

I propose a frequency domain adaptation of the Expectation Maximization (EM) algorithm to group a family of time series in classes of similar dynamic structure. It does this by viewing the magnitude of the discrete Fourier transform (DFT) of each signal (or power spectrum) as a probability density/mass function (pdf/pmf) on the unit circle: signals with similar dynamics have similar pdfs; distinct patterns have distinct pdfs. An advantage of this approach is that it does not rely on any parametric form of the dynamic structure, but can be used for non-parametric, robust and model-free classification. This new method works for non-stationary signals of similar shape as well as stationary signals with similar auto-correlation structure. Applications to neural spike sorting (non-stationary) and pattern-recognition in socio-economic time series (stationary) demonstrate the usefulness and wide applicability of the proposed method.


Modern hierarchical, agglomerative clustering algorithms

arXiv.org Machine Learning

This paper presents algorithms for hierarchical, agglomerative clustering which perform most efficiently in the general-purpose setup that is given in modern standard software. Requirements are: (1) the input data is given by pairwise dissimilarities between data points, but extensions to vector data are also discussed (2) the output is a "stepwise dendrogram", a data structure which is shared by all implementations in current standard software. We present algorithms (old and new) which perform clustering in this setting efficiently, both in an asymptotic worst-case analysis and from a practical point of view. The main contributions of this paper are: (1) We present a new algorithm which is suitable for any distance update scheme and performs significantly better than the existing algorithms. (2) We prove the correctness of two algorithms by Rohlf and Murtagh, which is necessary in each case for different reasons. (3) We give well-founded recommendations for the best current algorithms for the various agglomerative clustering schemes.


Characterization and exploitation of community structure in cover song networks

arXiv.org Machine Learning

The use of community detection algorithms is explored within the framework of cover song identification, i.e. the automatic detection of different audio renditions of the same underlying musical piece. Until now, this task has been posed as a typical query-by-example task, where one submits a query song and the system retrieves a list of possible matches ranked by their similarity to the query. In this work, we propose a new approach which uses song communities to provide more relevant answers to a given query. Starting from the output of a state-of-the-art system, songs are embedded in a complex weighted network whose links represent similarity (related musical content). Communities inside the network are then recognized as groups of covers and this information is used to enhance the results of the system. In particular, we show that this approach increases both the coherence and the accuracy of the system. Furthermore, we provide insight into the internal organization of individual cover song communities, showing that there is a tendency for the original song to be central within the community. We postulate that the methods and results presented here could be relevant to other query-by-example tasks.


Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis

arXiv.org Artificial Intelligence

We present a novel approach to the automatic acquisition of taxonomies or concept hierarchies from a text corpus. The approach is based on Formal Concept Analysis (FCA), a method mainly used for the analysis of data, i.e. for investigating and processing explicitly given information. We follow Harris' distributional hypothesis and model the context of a certain term as a vector representing syntactic dependencies which are automatically acquired from the text corpus with a linguistic parser. On the basis of this context information, FCA produces a lattice that we convert into a special kind of partial order constituting a concept hierarchy. The approach is evaluated by comparing the resulting concept hierarchies with handcrafted taxonomies for two domains: tourism and finance. We also directly compare our approach with hierarchical agglomerative clustering as well as with Bi-Section-KMeans as an instance of a divisive clustering algorithm. Furthermore, we investigate the impact of using different measures weighting the contribution of each attribute as well as of applying a particular smoothing technique to cope with data sparseness.