We consider a general regularised interpolation problem for learning a parameter vector from data. The well known representer theorem says that under certain conditions on the regulariser there exists a solution in the linear span of the data points. This is the core of kernel methods in machine learning as it makes the problem computationally tractable. Most literature deals only with sufficient conditions for representer theorems in Hilbert spaces. We prove necessary and sufficient conditions for the existence of representer theorems in reflexive Banach spaces and illustrate why in a sense reflexivity is the minimal requirement on the function space. We further show that if the learning relies on the linear representer theorem the solution is independent of the regulariser and in fact determined by the function space alone. This in particular shows the value of generalising Hilbert space learning theory to Banach spaces.

Shen, Cencheng, Vogelstein, Joshua T.

Distance-based methods, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community. Kernel methods, developed from "kernel mean embeddings", are leading methods for two-sample and independence tests from the machine learning community. Previous works demonstrated the equivalence of distance and kernel methods only at the population level, for each kind of test, requiring an embedding theory of kernels. We propose a simple, bijective transformation between semimetrics and nondegenerate kernels. We prove that for finite samples, two-sample tests are special cases of independence tests, and the distance-based statistic is equivalent to the kernel-based statistic, including the biased, unbiased, and normalized versions. In other words, upon setting the kernel or metric to be bijective of each other, running any of the four algorithms will yield the exact same answer up to numerical precision. This deepens and unifies our understanding of interpoint comparison based methods.

Pauwels, Edouard, Bach, Francis, Vert, Jean-Philippe

Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.

We consider a general regularised interpolation problem for learning a parameter vector from data. The well known representer theorem says that under certain conditions on the regulariser there exists a solution in the linear span of the data points. This is the core of kernel methods in machine learning as it makes the problem computationally tractable. Necessary and sufficient conditions for differentiable regularisers on Hilbert spaces to admit a representer theorem have been proved. We extend those results to nondifferentiable regularisers on uniformly convex and uniformly smooth Banach spaces. This gives a (more) complete answer to the question when there is a representer theorem. We then note that for regularised interpolation in fact the solution is determined by the function space alone and independent of the regulariser, making the extension to Banach spaces even more valuable.

We address the problem of {\it adaptivity} in the framework of reproducing kernel Hilbert space (RKHS) regression. More precisely, we analyze estimators arising from a linear regularization scheme $g_\lam$. In practical applications, an important task is to choose the regularization parameter $\lam$ appropriately, i.e. based only on the given data and independently on unknown structural assumptions on the regression function. An attractive approach avoiding data-splitting is the {\it Lepskii Principle} (LP), also known as the {\it Balancing Principle} is this setting. We show that a modified parameter choice based on (LP) is minimax optimal adaptive, up to $\log\log(n)$. A convenient result is the fact that balancing in $L^2(\nu)-$ norm, which is easiest, automatically gives optimal balancing in all stronger norms, interpolating between $L^2(\nu)$ and the RKHS. An analogous result is open for other classical approaches to data dependent choices of the regularization parameter, e.g. for Hold-Out.

Ding, Lizhong (King Abdullah University of Science and Technology (KAUST)) | Liao, Shizhong (Tianjin University) | Liu, Yong (CAS, Beijing, Institute of Information Engineering) | Yang, Peng (King Abdullah University of Science and Technology (KAUST)) | Gao, Xin (King Abdullah University of Science and Technology (KAUST))

Kernel selection aims at choosing an appropriate kernel function for kernel-based learning algorithms to avoid either underfitting or overfitting of the resulting hypothesis. One of the main problems faced by kernel selection is the evaluation of the goodness of a kernel, which is typically difficult and computationally expensive. In this paper, we propose a randomized kernel selection approach to evaluate and select the kernel with the spectra of the specifically designed multilevel circulant matrices (MCMs), which is statistically sound and computationally efficient. Instead of constructing the kernel matrix, we construct the randomized MCM to encode the kernel function and all data points together with labels. We build a one-to-one correspondence between all candidate kernel functions and the spectra of the randomized MCMs by Fourier transform. We prove the statistical properties of the randomized MCMs and the randomized kernel selection criteria, which theoretically qualify the utility of the randomized criteria in kernel selection. With the spectra of the randomized MCMs, we derive a series of randomized criteria to conduct kernel selection, which can be computed in log-linear time and linear space complexity by fast Fourier transform (FFT). Experimental results demonstrate that our randomized kernel selection criteria are significantly more efficient than the existing classic and widely-used criteria while preserving similar predictive performance.

Mukuta, Yusuke (The University of Tokyo) | Ushiku, Yoshitaka (The University of Tokyo) | Harada, Tatsuya (The University of Tokyo and RIKEN AIP )

The random features method is an efficient method to approximate the kernel function. In this paper, we propose novel random features called "alternating circulant random features,'' which consist of a random mixture of independent random structured matrices. Existing fast random features exploit random sign flipping to reduce the correlation between features. Sign flipping works well on random Fourier features for real-valued shift-invariant kernels because the corresponding weight distribution is symmetric. However, this method cannot be applied to random Laplace features directly because the distribution is not symmetric. The method proposed herein yields alternating circulant random features, with the correlation between features being reduced through the random sampling of weights from multiple independent random structured matrices instead of via random sign flipping. The proposed method facilitates rapid calculation by employing structured matrices. In addition, the weight distribution is preserved because sign flipping is not implemented. The performance of the proposed alternating circulant random features method is theoretically and empirically evaluated.

Damodaran, Bharath Bhushan, Courty, Nicolas, Gosselin, Philippe-Henri

Kernel methods are powerful and flexible approach to solve many problems in machine learning. Due to the pairwise evaluations in kernel methods, the complexity of kernel computation grows as the data size increases; thus the applicability of kernel methods is limited for large scale datasets. Random Fourier Features (RFF) has been proposed to scale the kernel method for solving large scale datasets by approximating kernel function using randomized Fourier features. While this method proved very popular, still it exists shortcomings to be effectively used. As RFF samples the randomized features from a distribution independent of training data, it requires sufficient large number of feature expansions to have similar performances to kernelized classifiers, and this is proportional to the number samples in the dataset. Thus, reducing the number of feature dimensions is necessary to effectively scale to large datasets. In this paper, we propose a kernel approximation method in a data dependent way, coined as Pseudo Random Fourier Features (PRFF) for reducing the number of feature dimensions and also to improve the prediction performance. The proposed approach is evaluated on classification and regression problems and compared with the RFF, orthogonal random features and Nystr{\"o}m approach

Boloix-Tortosa, Rafael, Murillo-Fuentes, Juan José, Velázquez, Irene Santos, Pérez-Cruz, Fernando

Usually, complex-valued RKHS are presented as an straightforward application of the real-valued case. In this paper we prove that this procedure yields a limited solution for regression. We show that another kernel, here denoted as pseudo kernel, is needed to learn any function in complex-valued fields. Accordingly, we derive a novel RKHS to include it, the widely RKHS (WRKHS). When the pseudo-kernel cancels, WRKHS reduces to complex-valued RKHS of previous approaches. We address the kernel and pseudo-kernel design, paying attention to the kernel and the pseudo-kernel being complex-valued. In the experiments included we report remarkable improvements in simple scenarios where real a imaginary parts have different similitude relations for given inputs or cases where real and imaginary parts are correlated. In the context of these novel results we revisit the problem of non-linear channel equalization, to show that the WRKHS helps to design more efficient solutions.

We propose a scheme for recycling Gaussian random vectors into structured matrices to approximate various kernel functions in sublinear time via random embeddings. Our framework includes the Fastfood construction as a special case, but also extends to Circulant, Toeplitz and Hankel matrices, and the broader family of structured matrices that are characterized by the concept of low-displacement rank. We introduce notions of coherence and graph-theoretic structural constants that control the approximation quality, and prove unbiasedness and low-variance properties of random feature maps that arise within our framework. For the case of low-displacement matrices, we show how the degree of structure and randomness can be controlled to reduce statistical variance at the cost of increased computation and storage requirements. Empirical results strongly support our theory and justify the use of a broader family of structured matrices for scaling up kernel methods using random features.