Statistical Learning
Bayesian Dropout
Herlau, Tue, Mørup, Morten, Schmidt, Mikkel N.
Dropout has recently emerged as a powerful and simple method for training neural networks preventing co-adaptation by stochastically omitting neurons. Dropout is currently not grounded in explicit modelling assumptions which so far has precluded its adoption in Bayesian modelling. Using Bayesian entropic reasoning we show that dropout can be interpreted as optimal inference under constraints. We demonstrate this on an analytically tractable regression model providing a Bayesian interpretation of its mechanism for regularizing and preventing co-adaptation as well as its connection to other Bayesian techniques. We also discuss two general approximate techniques for applying Bayesian dropout for general models, one based on an analytical approximation and the other on stochastic variational techniques. These techniques are then applied to a Baysian logistic regression problem and are shown to improve performance as the model become more misspecified. Our framework roots dropout as a theoretically justified and practical tool for statistical modelling allowing Bayesians to tap into the benefits of dropout training.
Identifying manifolds underlying group motion in Vicsek agents
Gajamannage, Kelum, Butail, Sachit, Porfiri, Maurizio, Bollt, Erik M.
In a topological sense, we describe these changes as switching between low-dimensional embedding manifolds underlying a group of evolving agents. To characterize such manifolds, first we introduce a simple mapping of agents between time-steps. Then, we construct a novel metric which is susceptible to variations in the collective motion, thus revealing distinct underlying manifolds. The method is validated through three sample scenarios simulated using a Vicsek model, namely switching of speed, coordination, and structure of a group. Combined with a dimensionality reduction technique that is used to infer the dimensionality of the embedding manifold, this approach provides an effective model-free framework for the analysis of collective behavior across animal species. In animal groups, the response to a perturbation--internal or external--is often manifested in the form of changes in group speed, coordination, or structure [3,5,11,16,27]. Such changes are witnessed in fish schools and bird flocks under attack [15,17,22], foraging animal groups [4, 8], and human crowds exposed to alarm situations leading to panic [12, 19]. Based on our recent effort demonstrating that collective motion is associated with a low-dimensional embedding [1, 2, 6, 7, 10], we expect that such behavioral changes should be manifested in variation of the topology of an underlying manifold.
Communication-efficient sparse regression: a one-shot approach
Lee, Jason D., Sun, Yuekai, Liu, Qiang, Taylor, Jonathan E.
Explosive growth in the size of modern datasets has fueled interest in distributed statistical learning. For examples, we refer to Boyd et al. (2011); Dekel et al. (2012); Duchi, Agarwal and Wainwright (2012); Zhang, Duchi and Wainwright (2013) and the references therein. The problem arises, for example, when working with datasets that are too large to fit on a single machine and must be distributed across multiple machines. The main bottleneck in the distributed setting is usually communication between machines/processors, so the overarching goal of algorithm design is to minimize communication costs.
Optimal estimates for short horizon travel time prediction in urban areas
Zliobaite, Indre, Khokhlov, Mikhail
Increasing popularity of mobile route planning applications based on GPS technology provides opportunities for collecting traffic data in urban environments. One of the main challenges for travel time estimation and prediction in such a setting is how to aggregate data from vehicles that have followed different routes, and predict travel time for other routes of interest. One approach is to predict travel times for route segments, and sum those estimates to obtain a prediction for the whole route. We study how to obtain optimal predictions in this scenario. It appears that the optimal estimate, minimizing the expected mean absolute error, is a combination of the mean and the median travel times on each segment, where the combination function depends on the number of segments in the route of interest. We present a methodology for obtaining such predictions, and demonstrate its effectiveness with a case study using travel time data from a district of St. Petersburg collected over one year. The proposed methodology can be applied for real-time prediction of expected travel times in an urban road network.
Model-based SIR for dimension reduction
A new dimension reduction method based on Gaussian finite mixtures is proposed as an extension to sliced inverse regression (SIR). The model-based SIR (MSIR) approach allows the main limitation of SIR to be overcome, i.e., failure in the presence of regression symmetric relationships, without the need to impose further assumptions. Extensive numerical studies are presented to compare the new method with some of most popular dimension reduction methods, such as SIR, sliced average variance estimation, principal Hessian direction, and directional regression. MSIR appears sufficiently flexible to accommodate various regression functions, and its performance is comparable with or better, particularly as sample size grows, than other available methods. Lastly, MSIR is illustrated with two real data examples about ozone concentration regression, and hand-written digit classification.
Kernel Methods for Linear Discrete-Time Equations
Colonius, Fritz, Hamzi, Boumediene
This paper discusses several problems in dynamical systems and control, where methods from learning theory are used in the state space of linear systems. This is in contrast to previous approaches in the frequency domain [19, 6]. We refer to [6] for a general survey on applications of machine learning to system identification. Basically, learning theory allows to deal with problems when only data from a given system are given. Reproducing Kernel Hilbert Spaces (RKHS) allow to work in a very large dimensional space in order to simplify the underlying problem.
Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels
Avron, Haim, Sindhwani, Vikas, Yang, Jiyan, Mahoney, Michael
We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large datasets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shift-invariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use Quasi-Monte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a low-discrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.
A variational approach to the consistency of spectral clustering
Trillos, Nicolás García, Slepčev, Dejan
This paper establishes the consistency of spectral approaches to data clustering. We consider clustering of point clouds obtained as samples of a ground-truth measure. A graph representing the point cloud is obtained by assigning weights to edges based on the distance between the points they connect. We investigate the spectral convergence of both unnormalized and normalized graph Laplacians towards the appropriate operators in the continuum domain. We obtain sharp conditions on how the connectivity radius can be scaled with respect to the number of sample points for the spectral convergence to hold. We also show that the discrete clusters obtained via spectral clustering converge towards a continuum partition of the ground truth measure. Such continuum partition minimizes a functional describing the continuum analogue of the graph-based spectral partitioning. Our approach, based on variational convergence, is general and flexible.
Consistency of random forests
Scornet, Erwan, Biau, Gérard, Vert, Jean-Philippe
Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultaneously analyze both the randomization process and the highly data-dependent tree structure. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's [Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity. 1. Introduction. Random forests are an ensemble learning method for classification and regression that constructs a number of randomized decision trees during the training phase and predicts by averaging the results. Since its publication in the seminal paper of Breiman (2001), the procedure has become a major data analysis tool, that performs well in practice in comparison with many standard methods. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use, the method is generally recognized for its accuracy and its ability to deal with small sample sizes, high-dimensional feature spaces and complex data structures. The random forest methodology has been successfully involved in many practical problems, including air quality prediction (winning code of the EMC data science global hackathon in 2012, see http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al. (2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3D
Extensions of stability selection using subsamples of observations and covariates
Beinrucker, Andre, Dogan, Ürün, Blanchard, Gilles
Variable selection techniques aim at identifying such relevant covariates (for a review see Guyon, 2006). Usually, variable selection aims at one of two goals: to identify informative covariates in order to get scientific insight into the data and the process that generated the outcome; or to use the covariates identified as relevant in order to predict the outcome. In this work we primarily focus on the identification of informative covariates but also consider prediction results using real data. We consider variable selection (also called feature selection in computer science-related communities) as a part of the broader field of dimensionality reduction. Many variable selection methods share the common drawback of being unstable with respect to small changes of the data: if one estimates the set of relevant covariates on different sets of observations coming from the same source, the result can vary significantly.