Statistical Learning
Modeling the spacing effect in sequential category learning
Lu, Hongjing, Weiden, Matthew, Yuille, Alan L.
We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We define conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efficiently performmodel selection to decide whether observations are from one or multiple categories.
Probabilistic Relational PCA
Li, Wu-jun, Yeung, Dit-Yan, Zhang, Zhihua
One crucial assumption made by both principal component analysis (PCA) and probabilistic PCA (PPCA) is that the instances are independent and identically distributed (i.i.d.). However, this common i.i.d. assumption is unreasonable for relational data. In this paper, by explicitly modeling covariance between instances as derived from the relational information, we propose a novel probabilistic dimensionality reduction method, called probabilistic relational PCA (PRPCA), for relational data analysis. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d. assumption. Experiments on real-world data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance.
Inter-domain Gaussian Processes for Sparse Inference using Inducing Features
Lรกzaro-Gredilla, Miguel, Figueiras-Vidal, Anรญbal
We present a general inference framework for inter-domain Gaussian Processes (GPs), focusing on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on finding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n^3) to O(m^2n), where m << n. Inter-domain GPs can be used to find a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models fit into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that significant improvement can be achieved, while retaining computational efficiency.
Ensemble Nystrom Method
Kumar, Sanjiv, Mohri, Mehryar, Talwalkar, Ameet
A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. We introduce a new family of algorithms based on mixtures of Nystrom approximations, ensemble Nystrom algorithms, that yield more accurate low-rank approximations than the standard Nystrom method. We give a detailed study of multiple variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystrom method. Finally, we report the results of extensive experiments with several data sets containing up to 1M points demonstrating the signi๏ฌcant performance improvements gained over the standard Nystrom approximation.
Fast, smooth and adaptive regression in metric spaces
It was recently shown that certain nonparametric regressors can escape the curse of dimensionality in the sense that their convergence rates adapt to the intrinsic dimension of data (\cite{BL:65, SK:77}). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, operates on a general metric space, yields a smooth function, and evaluates in time $O(\log n)$. We derive a tight convergence rate of the form $n^{-2/(2+d)}$ where $d$ is the Assouad dimension of the input space.
Sparsistent Learning of Varying-coefficient Models with Structural Changes
Kolar, Mladen, Song, Le, Xing, Eric P.
To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model --- piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models.
Efficient and Accurate Lp-Norm Multiple Kernel Learning
Kloft, Marius, Brefeld, Ulf, Laskov, Pavel, Mรผller, Klaus-Robert, Zien, Alexander, Sonnenburg, Sรถren
Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations and hence support interpretability. Unfortunately, L1-norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary Lp-norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p>1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply Lp-norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art.
Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks
Klampfl, Stefan, Maass, Wolfgang
Many models for computations in recurrent networks of neurons assume that the network state moves from some initial state to some fixed point attractor or limit cycle that represents the output of the computation. However experimental data show that in response to a sensory stimulus the network state moves from its initial state through a trajectory of network states and eventually returns to the initial state, without reaching an attractor or limit cycle in between. This type of network response, where salient information about external stimuli is encoded in characteristic trajectories of continuously varying network states, raises the question how a neural system could compute with such code, and arrive for example at a temporally stable classification of the external stimulus. We show that a known unsupervised learning algorithm, Slow Feature Analysis (SFA), could be an important ingredient for extracting stable information from these network trajectories. In fact, if sensory stimuli are more often followed by another stimulus from the same class than by a stimulus from another class, SFA approaches the classification capability of Fishers Linear Discriminant (FLD), a powerful algorithm for supervised learning. We apply this principle to simulated cortical microcircuits, and show that it enables readout neurons to learn discrimination of spoken digits and detection of repeating firing patterns within a stream of spike trains with the same firing statistics, without requiring any supervision for learning.
Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
Kim, Kwang I., Steinke, Florian, Hein, Matthias
Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Outgoing from these observations we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both of these problems, in particular, if the data lies on or close to a low-dimensional submanifold in the feature space, the Hessian energy prefers functions which vary ``linearly with respect to the natural parameters in the data. This property makes it also particularly suited for the task of semi-supervised dimensionality reduction where the goal is to find the natural parameters in the data based on a few labeled points. The experimental result suggest that our method is superior to semi-supervised regression using Laplacian regularization and standard supervised methods and is particularly suited for semi-supervised dimensionality reduction.
Directed Regression
Kao, Yi-hao, Roy, Benjamin V., Yan, Xiang
When used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. When there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. Empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. We propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. We demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. We also develop a theory that motivates the algorithm.