Goto

Collaborating Authors

 Mathematical & Statistical Methods


A Tutorial on Independent Component Analysis

arXiv.org Machine Learning

Independent component analysis (ICA) has become a standard data analysis technique applied to an array of problems in signal processing and machine learning. This tutorial provides an introduction to ICA based on linear algebra formulating an intuition for ICA from first principles. The goal of this tutorial is to provide a solid foundation on this advanced topic so that one might learn the motivation behind ICA, learn why and when to apply this technique and in the process gain an introduction to this exciting field of active research.


Stochastic processes and feedback-linearisation for online identification and Bayesian adaptive control of fully-actuated mechanical systems

arXiv.org Machine Learning

This work proposes a new method for simultaneous probabilistic identification and control of an observable, fully-actuated mechanical system. Identification is achieved by conditioning stochastic process priors on observations of configurations and noisy estimates of configuration derivatives. In contrast to previous work that has used stochastic processes for identification, we leverage the structural knowledge afforded by Lagrangian mechanics and learn the drift and control input matrix functions of the control-affine system separately. We utilise feedback-linearisation to reduce, in expectation, the uncertain nonlinear control problem to one that is easy to regulate in a desired manner. Thereby, our method combines the flexibility of nonparametric Bayesian learning with epistemological guarantees on the expected closed-loop trajectory. We illustrate our method in the context of torque-actuated pendula where the dynamics are learned with a combination of normal and log-normal processes.


Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

arXiv.org Machine Learning

Contrastive Divergence (CD) and Persistent Contrastive Divergence (PCD) are popular methods for training the weights of Restricted Boltzmann Machines. However, both methods use an approximate method for sampling from the model distribution. As a side effect, these approximations yield significantly different biases and variances for stochastic gradient estimates of individual data points. It is well known that CD yields a biased gradient estimate. In this paper we however show empirically that CD has a lower stochastic gradient estimate variance than exact sampling, while the mean of subsequent PCD estimates has a higher variance than exact sampling. The results give one explanation to the finding that CD can be used with smaller minibatches or higher learning rates than PCD.


An SIR Graph Growth Model for the Epidemics of Communicable Diseases

arXiv.org Machine Learning

It is the main purpose of this paper to introduce a graph-valued stochastic process in order to model the spread of a communicable infectious disease. The major novelty of the SIR model we promote lies in the fact that the social network on which the epidemics is taking place is not specified in advance but evolves through time, accounting for the temporal evolution of the interactions involving infective individuals. Without assuming the existence of a fixed underlying network model, the stochastic process introduced describes, in a flexible and realistic manner, epidemic spread in non-uniformly mixing and possibly heterogeneous populations. It is shown how to fit such a (parametrised) model by means of Approximate Bayesian Computation methods based on graph-valued statistics. The concepts and statistical methods described in this paper are finally applied to a real epidemic dataset, related to the spread of HIV in Cuba in presence of a contact tracing system, which permits one to reconstruct partly the evolution of the graph of sexual partners diagnosed HIV positive between 1986 and 2006.


Distinguishing noise from chaos: objective versus subjective criteria using Horizontal Visibility Graph

arXiv.org Machine Learning

A recently proposed methodology called the Horizontal Visibility Graph (HVG) [Luque {\it et al.}, Phys. Rev. E., 80, 046103 (2009)] that constitutes a geometrical simplification of the well known Visibility Graph algorithm [Lacasa {\it et al.\/}, Proc. Natl. Sci. U.S.A. 105, 4972 (2008)], has been used to study the distinction between deterministic and stochastic components in time series [L. Lacasa and R. Toral, Phys. Rev. E., 82, 036120 (2010)]. Specifically, the authors propose that the node degree distribution of these processes follows an exponential functional of the form $P(\kappa)\sim \exp(-\lambda~\kappa)$, in which $\kappa$ is the node degree and $\lambda$ is a positive parameter able to distinguish between deterministic (chaotic) and stochastic (uncorrelated and correlated) dynamics. In this work, we investigate the characteristics of the node degree distributions constructed by using HVG, for time series corresponding to $28$ chaotic maps and $3$ different stochastic processes. We thoroughly study the methodology proposed by Lacasa and Toral finding several cases for which their hypothesis is not valid. We propose a methodology that uses the HVG together with Information Theory quantifiers. An extensive and careful analysis of the node degree distributions obtained by applying HVG allow us to conclude that the Fisher-Shannon information plane is a remarkable tool able to graphically represent the different nature, deterministic or stochastic, of the systems under study.


Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Neural Information Processing Systems

Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellentperformance in retrieval tasks and intuitive formulation, their computation involvesthe resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms' dimensionexceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximumentropy perspective.We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance whichcan be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classification problem.


BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

Neural Information Processing Systems

The l1-regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20,000 variables. In this paper, we develop an algorithm BigQUIC, which can solve 1 million dimensional l1-regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that BigQUIC can achieve super-linear or even quadratic convergence rates.


An Approximate, Efficient LP Solver for LP Rounding

Neural Information Processing Systems

Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the exact one.These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation ofthe LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude fasterthan Cplex (a commercial LP solver) while producing solutions of similar quality.


Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

Neural Information Processing Systems

Given a convergent sequence of graphs, there exists a limit object called the graphon from which random graphs are generated. This nonparametric perspective of random graphs opens the door to study graphs beyond the traditional parametric models, but at the same time also poses the challenging question of how to estimate the graphon underlying observed graphs. In this paper, we propose a computationally efficient algorithm to estimate a graphon from a set of observed graphs generated from it. We show that, by approximating the graphon with stochastic block models, the graphon can be consistently estimated, that is, the estimation error vanishes as the size of the graph approaches infinity.


Large Scale Distributed Sparse Precision Estimation

Neural Information Processing Systems

We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in column-blocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores.