output kernel
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The submission describes a convex deep learning formulation that leverages a number of key ideas. First, a training objective is proposed that explicitly includes the outputs of hidden layers as variables to be inferred via optimization. These are linked to linear responses via a loss function, and the net objective is the sum of these loss functions across the layers, plus some regularization terms. Next, a number of changes of variables are performed in order to reparameterize the objective into a convex form, heavily leveraging the representer theorem and the idea of value regularization. We are left with a convex objective in terms of three different matrices (per layer) to optimize. In particular, one of these matrices is a nonparametric'normalized output kernel' matrix, which takes the place of optimizing over the hidden layer outputs directly; however, this leads to a transductive method where we must simultaneously solve the optimization for training and test inputs.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
More specifically, it deals with learning the output kernel which defined over multiple tasks. Starting from the vector-valued RKHS formulation of the multi-task learning problem, the authors propose an output kernel learning algorithm that are able to learn both the multi-task learning function and task dependencies. The authors first provide an optimization formulation of the problem which is jointly convex. Then they show that this optimization can be solved without the positive-semidefiniteness constraint of the output kernel, which is computationally costly, and propose the use of a stochastic dual coordinate ascent method to solve it. Experiments on multi-task data sets and comparison in terms of performance and running time with previous multi-task learning algorithms are provided. The paper builds upon recent studies on multitask learning and output kernel learning.
Efficient Output Kernel Learning for Multiple Tasks
The paradigm of multi-task learning is that one can achieve better generalization by learning tasks jointly and thus exploiting the similarity between the tasks rather than learning them independently of each other. While previously the relationship between tasks had to be user-defined in the form of an output kernel, recent approaches jointly learn the tasks and the output kernel. As the output kernel is a positive semidefinite matrix, the resulting optimization problems are not scalable in the number of tasks as an eigendecomposition is required in each step. Using the theory of positive semidefinite kernels we show in this paper that for a certain class of regularizers on the output kernel, the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem. This leads to an unconstrained dual problem which can be solved efficiently. Experiments on several multi-task and multi-class data sets illustrate the efficacy of our approach in terms of computational efficiency as well as generalization performance.
Why bigger is not always better: on finite and infinite neural networks
Recent work has shown that the outputs of convolutional neural networks become Gaussian process (GP) distributed when we take the number of channels to infinity. In principle, these infinite networks should perform very well, both because they allow for exact Bayesian inference, and because widening networks is generally thought to improve (or at least not diminish) performance. However, Bayesian infinite networks perform poorly in comparison to finite networks, and our goal here is to explain this discrepancy. We note that the high-level representation induced by an infinite network has very little flexibility; it depends only on network hyperparameters such as depth, and as such cannot learn a good high-level representation of data. In contrast, finite networks correspond to a rich prior over high-level representations, corresponding to kernel hyperparameters. We analyse this flexibility from the perspective of the prior (looking at the structured prior covariance of the top-level kernel), and from the perspective of the posterior, showing that the representation in a learned, finite deep linear network slowly transitions from the kernel induced by the inputs towards the kernel induced by the outputs, both for gradient descent, and for Langevin sampling. Finally, we explore representation learning in deep, convolutional, nonlinear networks, showing that learned representations differ dramatically from the corresponding infinite network. One approach to understanding and improving neural networks is to perform Bayesian inference in an infinitely wide network, which can be done both for fully connected (Lee et al., 2018; Matthews et al., 2018) and convolutional networks (Garriga-Alonso et al., 2019; Novak et al., 2019).
Do place cells dream of conditional probabilities? Learning Neural Nystr\"om representations
We posit that hippocampal place cells encode information about future locations under a transition distribution observed as an agent explores a given (physical or conceptual) space. The encoding of information about the current location, usually associated with place cells, then emerges as a necessary step to achieve this broader goal. We formally derive a biologically-inspired neural network from Nystr\"om kernel approximations and empirically demonstrate that the network successfully approximates transition distributions. The proposed network yields representations that, just like place cells, soft-tile the input space with highly sparse and localized receptive fields. Additionally, we show that the proposed computational motif can be extended to handle supervised problems, creating class-specific place cells while exhibiting low sample complexity.
Forecasting and Granger Modelling with Non-linear Dynamical Dependencies
Gregorovรก, Magda, Kalousis, Alexandros, Marchand-Maillet, Stรฉphane
Traditional linear methods for forecasting multivariate time series are not able to satisfactorily model the non-linear dependencies that may exist in non-Gaussian series. We build on the theory of learning vector-valued functions in the reproducing kernel Hilbert space and develop a method for learning prediction functions that accommodate such non-linearities. The method not only learns the predictive function but also the matrix-valued kernel underlying the function search space directly from the data. Our approach is based on learning multiple matrix-valued kernels, each of those composed of a set of input kernels and a set of output kernels learned in the cone of positive semi-definite matrices. In addition to superior predictive performance in the presence of strong non-linearities, our method also recovers the hidden dynamic relationships between the series and thus is a new alternative to existing graphical Granger techniques.
Efficient Output Kernel Learning for Multiple Tasks
Jawanpuria, Pratik Kumar, Lapin, Maksim, Hein, Matthias, Schiele, Bernt
The paradigm of multi-task learning is that one can achieve better generalization by learning tasks jointly and thus exploiting the similarity between the tasks rather than learning them independently of each other. While previously the relationship between tasks had to be user-defined in the form of an output kernel, recent approaches jointly learn the tasks and the output kernel. As the output kernel is a positive semidefinite matrix, the resulting optimization problems are not scalable in the number of tasks as an eigendecomposition is required in each step. Using the theory of positive semidefinite kernels we show in this paper that for a certain class of regularizers on the output kernel, the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem. This leads to an unconstrained dual problem which can be solved efficiently. Experiments on several multi-task and multi-class data sets illustrate the efficacy of our approach in terms of computational efficiency as well as generalization performance.
Efficient Output Kernel Learning for Multiple Tasks
Jawanpuria, Pratik, Lapin, Maksim, Hein, Matthias, Schiele, Bernt
The paradigm of multi-task learning is that one can achieve better generalization by learning tasks jointly and thus exploiting the similarity between the tasks rather than learning them independently of each other. While previously the relationship between tasks had to be user-defined in the form of an output kernel, recent approaches jointly learn the tasks and the output kernel. As the output kernel is a positive semidefinite matrix, the resulting optimization problems are not scalable in the number of tasks as an eigendecomposition is required in each step. Using the theory of positive semidefinite kernels we show in this paper that for a certain class of regularizers on the output kernel, the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem. This leads to an unconstrained dual problem which can be solved efficiently. Experiments on several multi-task and multi-class data sets illustrate the efficacy of our approach in terms of computational efficiency as well as generalization performance.
Scalable Matrix-valued Kernel Learning for High-dimensional Nonlinear Multivariate Regression and Granger Causality
Sindhwani, Vikas, Quang, Minh Ha, Lozano, Aurelie C.
We propose a general matrix-valued multiple kernel learning framework for high-dimensional nonlinear multivariate regression problems. This framework allows a broad class of mixed norm regularizers, including those that induce sparsity, to be imposed on a dictionary of vector-valued Reproducing Kernel Hilbert Spaces. We develop a highly scalable and eigendecomposition-free algorithm that orchestrates two inexact solvers for simultaneously learning both the input and output components of separable matrix-valued kernels. As a key application enabled by our framework, we show how high-dimensional causal inference tasks can be naturally cast as sparse function estimation problems, leading to novel nonlinear extensions of a class of Graphical Granger Causality techniques. Our algorithmic developments and extensive empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds.