Multiple kernel learning (MKL), structured sparsity, and multi-task learning have recently received considerable attention. In this paper, we show how different MKL algorithms can be understood as applications of either regularization on the kernel weights or block-norm-based regularization, which is more common in structured sparsity and multi-task learning. We show that these two regularization strategies can be systematically mapped to each other through a concave conjugate operation. When the kernel-weight-based regularizer is separable into components, we can naturally consider a generative probabilistic model behind MKL. Based on this model, we propose learning algorithms for the kernel weights through the maximization of marginal likelihood. We show through numerical experiments that $\ell_2$-norm MKL and Elastic-net MKL achieve comparable accuracy to uniform kernel combination. Although uniform kernel combination might be preferable from its simplicity, $\ell_2$-norm MKL and Elastic-net MKL can learn the usefulness of the information sources represented as kernels. In particular, Elastic-net MKL achieves sparsity in the kernel weights.
In this paper we focus on training deep neural networks for visual recognition tasks. One challenge is the lack of an informative regularization on the network parameters, to imply a meaningful control on the computed function. We propose a training strategy that takes advantage of kernel methods, where an existing kernel function represents useful prior knowledge about the learning task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate very positive results in a wide range of visual recognition tasks.
The theory of RKHS provides an elegant framework for supervised learning. It is the foundation of all kernel methods in machine learning. Implicit in its formulation is the use of a quadratic regularizer associated with the underlying inner product which imposes smoothness constraints. In this paper, we consider instead the generalized total-variation (gTV) norm as the sparsity-promoting regularizer. This leads us to propose a new Banach-space framework that justifies the use of generalized LASSO, albeit in a slightly modified version. We prove a representer theorem for multiple-kernel regression (MKR) with gTV regularization. The theorem states that the solutions of MKR have kernel expansions with adaptive positions, while the gTV norm enforces an $\ell_1$ penalty on the coefficients. We discuss the sparsity-promoting effect of the gTV norm which prevents redundancy in the multiple-kernel scenario.
Early stopping of iterative algorithms is a widely-used form of regularization in statistical learning, commonly used in conjunction with boosting and related gradient-type algorithms. Although consistency results have been established in some settings, such estimators are less well-understood than their analogues based on penalized regularization. In this paper, for a relatively broad class of loss functions and boosting algorithms (including $L^2$-boost, LogitBoost and AdaBoost, among others), we connect the performance of a stopped iterate to the localized Rademacher/Gaussian complexity of the associated function class. This connection allows us to show that local fixed point analysis, now standard in the analysis of penalized estimators, can be used to derive optimal stopping rules. We derive such stopping rules in detail for various kernel classes, and illustrate the correspondence of our theory with practice for Sobolev kernel classes.
The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm . In this paper, we present an algorithm thatcomputes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity ofsolving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, weshow empirically that the effect of the block 1-norm regularization differsnotably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case.