United States
Algorithms for Infinitely Many-Armed Bandits
Wang, Yizao, Audibert, Jean-yves, Munos, Rémi
We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being anear-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases.
Multi-label Multiple Kernel Learning
Ji, Shuiwang, Sun, Liang, Jin, Rong, Ye, Jieping
We present a multi-label multiple kernel learning (MKL) formulation, in which the data are embedded into a low-dimensional space directed by the instance-label correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, and it can be cast into a semi-infinite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained and convex optimization problem. In addition, we show that the objective function of the approximate formulation is continuously differentiable with Lipschitz gradient, and hence existing methods can be employed to compute the optimal solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.
One sketch for all: Theory and Application of Conditional Random Sampling
Li, Ping, Church, Kenneth W., Hastie, Trevor J.
Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise ($l_2$, $l_1$) distances, in static, large-scale, and sparse data sets such as text and Web data. It was previously presented using a heuristic argument. This study extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with other known sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is ``one-sketch-for-all.'' In particular, we demonstrate that CRS can be applied to efficiently compute the $l_p$ distance and the Hilbertian metrics, both are popular in machine learning. Although a fully rigorous analysis of CRS is difficult, we prove that, with a simple modification, CRS is rigorous at least for an important application of computing Hamming norms. A generic estimator and an approximate variance formula are provided and tested on various applications, for computing Hamming norms, Hamming distances, and $\chi^2$ distances.
Differentiable Sparse Coding
Bagnell, J. A., Bradley, David M.
We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently withimplicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, andfind that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance.
Robust Regression and Lasso
Xu, Huan, Caramanis, Constantine, Mannor, Shie
We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to $\ell_1$ regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective.
Efficient Recovery of Jointly Sparse Vectors
Sun, Liang, Liu, Jun, Chen, Jianhui, Ye, Jieping
We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model,in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the $(2,1)$-norm minimization, which is an extension of the well-known $1$-norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a second-order cone programming (SOCP) or semidefinite programming (SDP), which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the prox-method. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm.
Human Rademacher Complexity
Zhu, Jerry, Gibson, Bryan R., Rogers, Timothy T.
We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learners ability to fit random data, and can be used to bound the learners true error based on the observed training sample error. We first review the definition of Rademacher complexity and its generalization bound. We then describe a learning the noise" procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overfitting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science."
Sensitivity analysis in HMMs with application to likelihood maximization
Coquelin, Pierre-arnaud, Deguest, Romain, Munos, Rémi
This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.
A Neural Implementation of the Kalman Filter
There is a growing body of experimental evidence to suggest that the brain is capable of approximating optimal Bayesian inference in the face of noisy input stimuli. Despite this progress, the neural underpinnings of this computation are still poorly understood. In this paper we focus on the problem of Bayesian filtering of stochastic time series. In particular we introduce a novel neural network, derived from a line attractor architecture, whose dynamics map directly onto those of the Kalman Filter in the limit where the prediction error is small. When the prediction error is large we show that the network responds robustly to change-points in a way that is qualitatively compatible with the optimal Bayesian model. The model suggests ways in which probability distributions are encoded in the brain and makes a number of testable experimental predictions.
Boosting with Spatial Regularization
Xi, Yongxin, Hasson, Uri, Ramadge, Peter J., Xiang, Zhen J.
By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a ``grouping effect, which encourages the selection of all spatially local, discriminative base classifiers. The algorithms primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithms performance on various data sets.