Goto

Collaborating Authors

 Clustering


A time series distance measure for efficient clustering of input output signals by their underlying dynamics

arXiv.org Machine Learning

Starting from a dataset with input/output time series generated by multiple deterministic linear dynamical systems, this paper tackles the problem of automatically clustering these time series. We propose an extension to the so-called Martin cepstral distance, that allows to efficiently cluster these time series, and apply it to simulated electrical circuits data. Traditionally, two ways of handling the problem are used. The first class of methods employs a distance measure on time series (e.g. Euclidean, Dynamic Time Warping) and a clustering technique (e.g. k-means, k-medoids, hierarchical clustering) to find natural groups in the dataset. It is, however, often not clear whether these distance measures effectively take into account the specific temporal correlations in these time series. The second class of methods uses the input/output data to identify a dynamic system using an identification scheme, and then applies a model norm-based distance (e.g. H2, H-infinity) to find out which systems are similar. This, however, can be very time consuming for large amounts of long time series data. We show that the new distance measure presented in this paper performs as good as when every input/output pair is modelled explicitly, but remains computationally much less complex. The complexity of calculating this distance between two time series of length N is O(N logN).


Spectral Clustering via Graph Filtering: Consistency on the High-Dimensional Stochastic Block Model

arXiv.org Machine Learning

Spectral clustering is amongst the most popular methods for community detection in graphs. A key step in spectral clustering algorithms is the eigen-decomposition of the $n{\times}n$ graph Laplacian matrix to extract its $k$ leading eigenvectors, where $k$ is the desired number of clusters among $n$ objects. This is prohibitively complex to implement for very large datasets. However, it has recently been shown that it is possible to bypass the eigen-decomposition by computing an approximate spectral embedding through graph filtering of random signals. In this paper, we prove that spectral clustering performed via graph filtering can still recover the planted clusters consistently, under mild conditions. We analyse the effects of sparsity, dimensionality and filter approximation error on the consistency of the algorithm.


Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables

arXiv.org Machine Learning

This note is concerned with accurate and computationally efficient approximations of moments of Gaussian random variables passed through sigmoid or softmax mappings. These approximations are semi-analytical (i.e. they involve the numerical adjustment of parametric forms) and highly accurate (they yield 5% error at most). We also highlight a few niche applications of these approximations, which arise in the context of, e.g., drift-diffusion models of decision making or non-parametric data clustering approaches. We provide these as examples of efficient alternatives to more tedious derivations that would be needed if one was to approach the underlying mathematical issues in a more formal way. We hope that this technical note will be helpful to modellers facing similar mathematical issues, although maybe stemming from different academic prospects.


Phylogenetic Tools in Astrophysics

arXiv.org Machine Learning

Multivariate clustering in astrophysics is a recent development justified by the bigger and bigger surveys of the sky. The phylogenetic approach is probably the most unexpected technique that has appeared for the unsupervised classification of galaxies, stellar populations or globular clusters. On one side, this is a somewhat natural way of classifying astrophysical entities which are all evolving objects. On the other side, several conceptual and practical difficulties arize, such as the hierarchical representation of the astrophysical diversity, the continuous nature of the parameters, and the adequation of the result to the usual practice for the physical interpretation. Most of these have now been solved through the studies of limited samples of stellar clusters and galaxies. Up to now, only the Maximum Parsimony (cladistics) has been used since it is the simplest and most general phylogenetic technique. Probabilistic and network approaches are obvious extensions that should be explored in the future.


A description length approach to determining the number of k-means clusters

arXiv.org Machine Learning

We present an asymptotic criterion to determine the optimal number of clusters in k-means. We consider k-means as data compression, and propose to adopt the number of clusters that minimizes the estimated description length after compression. Here we report two types of compression ratio based on two ways to quantify the description length of data after compression. This approach further offers a way to evaluate whether clusters obtained with k-means have a hierarchical structure by examining whether multi-stage compression can further reduce the description length. We applied our criteria to determine the number of clusters to synthetic data and empirical neuroimaging data to observe the behavior of the criteria across different types of data set and suitability of the two types of criteria for different datasets. We found that our method can offer reasonable clustering results that are useful for dimension reduction. While our numerical results revealed dependency of our criteria on the various aspects of dataset such as the dimensionality, the description length approach proposed here provides a useful guidance to determine the number of clusters in a principled manner when underlying properties of the data are unknown and only inferred from observation of data.


Multi-Sensor Data Pattern Recognition for Multi-Target Localization: A Machine Learning Approach

arXiv.org Machine Learning

Conducting surveillance missions using sensor networks is essential for many civilian and military applications, such as disaster response [1], border patrol [2], force protection [3], [4], combat missions [5], and traffic management [6]. One main task in these missions is to collect data regarding the operational environment and then obtain intelligence information from the data. Because the sensors used to collect data are often spatially distributed, extracting data patterns becomes critical to obtain accurate knowledge about the underlying activities. The existing work on identifying data patterns from spatially distributed sensors is focused on developing probabilistic reasoning techniques without recognizing the specific data association or data patterns. Existing approaches for multitarget state estimation can be characterized by two features: a data-to-target assignment algorithm, and an algorithm for single target state estimation under preexisting data-to-target associations. With unknown data association, probabilistic data association (PDA) [7] and multiple hypothesis tracking (MHT) [8] are two common approaches where dense measurements are available. In the study of traffic patterns, the existing research is focused on estimating traffic density and smart routes [6] without analyzing the data pattern to obtain better knowledge of traffic information.


Uniform Deviation Bounds for Unbounded Loss Functions like k-Means

arXiv.org Machine Learning

Uniform deviation bounds limit the difference between a model's expected loss and its loss on an empirical sample uniformly for all models in a learning problem. As such, they are a critical component to empirical risk minimization. In this paper, we provide a novel framework to obtain uniform deviation bounds for loss functions which are *unbounded*. In our main application, this allows us to obtain bounds for $k$-Means clustering under weak assumptions on the underlying distribution. If the fourth moment is bounded, we prove a rate of $\mathcal{O}\left(m^{-\frac12}\right)$ compared to the previously known $\mathcal{O}\left(m^{-\frac14}\right)$ rate. Furthermore, we show that the rate also depends on the kurtosis - the normalized fourth moment which measures the "tailedness" of a distribution. We further provide improved rates under progressively stronger assumptions, namely, bounded higher moments, subgaussianity and bounded support.


Scalable and Distributed Clustering via Lightweight Coresets

arXiv.org Machine Learning

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct light-weight coresets for k-Means clustering, Bregman clustering and maximum likelihood estimation of Gaussian mixture models. The algorithm is substantially faster than existing constructions, embarrassingly parallel and resulting coresets are smaller. In an extensive experimental evaluation, we demonstrate that the proposed method outperforms existing coreset constructions.


The Shape of Data and Probability Measures

arXiv.org Machine Learning

We introduce the notion of multiscale covariance tensor fields (CTF) associated with Euclidean random variables as a gateway to the shape of their distributions. Multiscale CTFs quantify variation of the data about every point in the data landscape at all spatial scales, unlike the usual covariance tensor that only quantifies global variation about the mean. Empirical forms of localized covariance previously have been used in data analysis and visualization, but we develop a framework for the systematic treatment of theoretical questions and computational models based on localized covariance. We prove strong stability theorems with respect to the Wasserstein distance between probability measures, obtain consistency results, as well as estimates for the rate of convergence of empirical CTFs. These results ensure that CTFs are robust to sampling, noise and outliers. We provide numerous illustrations of how CTFs let us extract shape from data and also apply CTFs to manifold clustering, the problem of categorizing data points according to their noisy membership in a collection of possibly intersecting, smooth submanifolds of Euclidean space. We prove that the proposed manifold clustering method is stable and carry out several experiments to validate the method.


On Context-Dependent Clustering of Bandits

arXiv.org Artificial Intelligence

We investigate a novel cluster-of-bandit algorithm CAB for collaborative recommendation tasks that implements the underlying feedback sharing mechanism by estimating the neighborhood of users in a context-dependent manner. CAB makes sharp departures from the state of the art by incorporating collaborative effects into inference as well as learning processes in a manner that seamlessly interleaving explore-exploit tradeoffs and collaborative steps. We prove regret bounds under various assumptions on the data, which exhibit a crisp dependence on the expected number of clusters over the users, a natural measure of the statistical difficulty of the learning task. Experiments on production and real-world datasets show that CAB offers significantly increased prediction performance against a representative pool of state-of-the-art methods.