# Kernel Methods

### Kernel methods and classes -- kernelmethods 0.2 documentation

This library fills an important void in the ever-growing python-based machine learning ecosystem, where users can only use predefined kernels and are not able to customize or extend them for their own applications, that demand great flexibility owing to their diversity and need for better performing kernel. This library defines the KernelMatrix class that is central to all the kernel methods and machines. As the KernelMatrix class is a key bridge between input data and the various kernel learning algorithms, it is designed to be highly usable and extensible to different applications and data types. Besides being able to apply basic kernels on a given sample (to produce a KernelMatrix), this library provides various kernel operations, such as normalization, centering, product, alignment evaluation, linear combination and ranking (by various performance metrics) of kernel matrices. In addition, we provide several convenient classes, such as KernelSet and KernelBucket for easy management of a large collection of kernels.

### Demo: Kernel methods for machine learning applications · Issue #1 · ohbm/OpenScienceRoom2019

This library fills an important void in the ever-growing python-based machine learning ecosystem, where users are limited to few predefined kernels without the ability to customize or extend them for their own applications. This library defines the KernelMatrix class that is central to all the kernel methods. As it is a key bridge between input data and kernel learning algorithms, it is designed to be highly usable and extensible to different applications and data types. Kernel operations implemented are normalization, centering, product, alignment, linear combination and ranking. Convenience classes, such as Kernel{Set,Bucket}, are designed for easy management of a large collection of kernels. Dealing with diverse kernels and their fusion is necessary for automatic kernel selection in applications such as Multiple Kernel Learning. Besides numerical kernels, we designed this library to provide categorical, string and graph kernels, with the same attractive properties of intuitive and highly-testable API. Besides non-numerical kernels, we aim to provide a deeply extensible framework for arbitrary input data types, such as sequences and trees, via pyradigm. Moreover, drop-in Estimator classes are provided for seamless usage in scikit-learn ecosystem.

### Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm

How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as $n^{-\beta}$ where $n$ is the number of training examples and $\beta$ an exponent that depends on both data and algorithm. In this work we measure $\beta$ when applying kernel methods to real datasets. For MNIST we find $\beta\approx 0.4$ and for CIFAR10 $\beta\approx 0.1$. Remarkably, $\beta$ is the same for regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we introduce the Teacher-Student framework for kernels. In this scheme, a Teacher generates data according to a Gaussian random field, and a Student learns them via kernel regression. With a simplifying assumption --- namely that the data are sampled from a regular lattice --- we derive analytically $\beta$ for translation invariant kernels, using previous results from the kriging literature. Provided that the Student is not too sensitive to high frequencies, $\beta$ depends only on the training data and their dimension. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, our results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data should be defined in terms of how the distance between nearest data points depends on $n$. With this definition one obtains reasonable effective smoothness estimates for MNIST and CIFAR10.

### Inductive Regularized Learning of Kernel Functions

In this paper we consider the fundamental problem of semi-supervised kernel function learning. We propose a general regularized framework for learning a kernel matrix, and then demonstrate an equivalence between our proposed kernel matrix learning framework and a general linear transformation learning problem. Our result shows that the learned kernel matrices parameterize a linear transformation kernel function and can be applied inductively to new data points. Furthermore, our result gives a constructive method for kernelizing most existing Mahalanobis metric learning formulations. To make our results practical for large-scale data, we modify our framework to limit the number of parameters in the optimization process.

### Compressed Diffusion

Diffusion maps are a commonly used kernel-based method for manifold learning, which can reveal intrinsic structures in data and embed them in low dimensions. However, as with most kernel methods, its implementation requires a heavy computational load, reaching up to cubic complexity in the number of data points. This limits its usability in modern data analysis. Here, we present a new approach to computing the diffusion geometry, and related embeddings, from a compressed diffusion process between data regions rather than data points. Our construction is based on an adaptation of the previously proposed measure-based (MGC) kernel that robustly captures the local geometry around data points. We use this MGC kernel to efficiently compress diffusion relations from pointwise to data region resolution. Finally, a spectral embedding of the data regions provides coordinates that are used to interpolate and approximate the pointwise diffusion map embedding of data. We analyze theoretical connections between our construction and the original diffusion geometry of diffusion maps, and demonstrate the utility of our method in analyzing big datasets, where it outperforms competing approaches.

### Relating Leverage Scores and Density using Regularized Christoffel Functions

Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.

### When is there a Representer Theorem? Reflexive Banach spaces

We consider a general regularised interpolation problem for learning a parameter vector from data. The well known representer theorem says that under certain conditions on the regulariser there exists a solution in the linear span of the data points. This is the core of kernel methods in machine learning as it makes the problem computationally tractable. Most literature deals only with sufficient conditions for representer theorems in Hilbert spaces. We prove necessary and sufficient conditions for the existence of representer theorems in reflexive Banach spaces and illustrate why in a sense reflexivity is the minimal requirement on the function space. We further show that if the learning relies on the linear representer theorem the solution is independent of the regulariser and in fact determined by the function space alone. This in particular shows the value of generalising Hilbert space learning theory to Banach spaces.

### The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing

Distance-based methods, also called "energy statistics", are leading methods for two-sample and independence tests from the statistics community. Kernel methods, developed from "kernel mean embeddings", are leading methods for two-sample and independence tests from the machine learning community. Previous works demonstrated the equivalence of distance and kernel methods only at the population level, for each kind of test, requiring an embedding theory of kernels. We propose a simple, bijective transformation between semimetrics and nondegenerate kernels. We prove that for finite samples, two-sample tests are special cases of independence tests, and the distance-based statistic is equivalent to the kernel-based statistic, including the biased, unbiased, and normalized versions. In other words, upon setting the kernel or metric to be bijective of each other, running any of the four algorithms will yield the exact same answer up to numerical precision. This deepens and unifies our understanding of interpoint comparison based methods.

### Relating Leverage Scores and Density using Regularized Christoffel Functions

Statistical leverage scores emerged as a fundamental tool for matrix sketching and column sampling with applications to low rank approximation, regression, random feature learning and quadrature. Yet, the very nature of this quantity is barely understood. Borrowing ideas from the orthogonal polynomial literature, we introduce the regularized Christoffel function associated to a positive definite kernel. This uncovers a variational formulation for leverage scores for kernel methods and allows to elucidate their relationships with the chosen kernel as well as population density. Our main result quantitatively describes a decreasing relation between leverage score and population density for a broad class of kernels on Euclidean spaces. Numerical simulations support our findings.

### When is there a Representer Theorem? Nondifferentiable Regularisers and Banach spaces

We consider a general regularised interpolation problem for learning a parameter vector from data. The well known representer theorem says that under certain conditions on the regulariser there exists a solution in the linear span of the data points. This is the core of kernel methods in machine learning as it makes the problem computationally tractable. Necessary and sufficient conditions for differentiable regularisers on Hilbert spaces to admit a representer theorem have been proved. We extend those results to nondifferentiable regularisers on uniformly convex and uniformly smooth Banach spaces. This gives a (more) complete answer to the question when there is a representer theorem. We then note that for regularised interpolation in fact the solution is determined by the function space alone and independent of the regulariser, making the extension to Banach spaces even more valuable.