Kernel methods are a class of algorithms for pattern analysis, whose best known member is the support vector machine (SVM). In its simplest form, the kernel trick means transforming data into another dimension that has a clear dividing margin between classes of data. (Wikipedia)
Kernel methods, a new generation of learning algorithms, utilize techniques from optimization, statistics, and functional analysis to achieve maximal generality, flexibility, and performance. These algorithms are different from earlier techniques used in machine learning in many respects: For example, they are explicitly based on a theoretical model of learning rather than on loose analogies with natural learning systems or other heuristics. They come with theoretical guarantees about their performance and have a modular design that makes it possible to separately implement and analyze their components. They are not affected by the problem of local minima because their training amounts to convex optimization. In the last decade, a sizable community of theoreticians and practitioners has formed around these methods, and a number of practical applications have been realized.
Kernel methods are powerful machine learning techniques which implement generic non-linear functions to solve complex tasks in a simple way. They Have a solid mathematical background and exhibit excellent performance in practice. However, kernel machines are still considered black-box models as the feature mapping is not directly accessible and difficult to interpret.The aim of this work is to show that it is indeed possible to interpret the functions learned by various kernel methods is intuitive despite their complexity. Specifically, we show that derivatives of these functions have a simple mathematical formulation, are easy to compute, and can be applied to many different problems. We note that model function derivatives in kernel machines is proportional to the kernel function derivative. We provide the explicit analytic form of the first and second derivatives of the most common kernel functions with regard to the inputs as well as generic formulas to compute higher order derivatives. We use them to analyze the most used supervised and unsupervised kernel learning methods: Gaussian Processes for regression, Support Vector Machines for classification, Kernel Entropy Component Analysis for density estimation, and the Hilbert-Schmidt Independence Criterion for estimating the dependency between random variables. For all cases we expressed the derivative of the learned function as a linear combination of the kernel function derivative. Moreover we provide intuitive explanations through illustrative toy examples and show how to improve the interpretation of real applications in the context of spatiotemporal Earth system data cubes. This work reflects on the observation that function derivatives may play a crucial role in kernel methods analysis and understanding.
Multiple Kernel Learning is a recent and powerful paradigm to learn the kernel function from data. In this paper, we introduce MKLpy, a python-based framework for Multiple Kernel Learning. The library provides Multiple Kernel Learning algorithms for classification tasks, mechanisms to compute kernel functions for different data types, and evaluation strategies. The library is meant to maximize the usability and to simplify the development of novel solutions.
The Tanimoto kernel (Jaccard index) is a well known tool to describe the similarity between sets of binary attributes. It has been extended to the case when the attributes are nonnegative real values. This paper introduces a more general Tanimoto kernel formulation which allows to measure the similarity of arbitrary real-valued functions. This extension is constructed by unifying the representation of the attributes via properly chosen sets. After deriving the general form of the kernel, explicit feature representation is extracted from the kernel function, and a simply way of including general kernels into the Tanimoto kernel is shown. Finally, the kernel is also expressed as a quotient of piecewise linear functions, and a smooth approximation is provided.
By using the framework of Determinantal Point Processes (DPPs), some theoretical results concerning the interplay between diversity and regularization can be obtained. In this paper we show that sampling subsets with kDPPs results in implicit regularization in the context of ridgeless Kernel Regression. Furthermore, we leverage the common setup of state-of-the-art DPP algorithms to sample multiple small subsets and use them in an ensemble of ridgeless regressions. Our first empirical results indicate that ensemble of ridgeless regressors can be interesting to use for datasets including redundant information.
Kernel methods have proven to be powerful techniques for pattern analysis and machine learning (ML) in a variety of domains. However, many of their original or advanced implementations remain in Matlab. With the incredible rise and adoption of Python in the ML and data science world, there is a clear need for a well-defined library that enables not only the use of popular kernels, but also allows easy definition of customized kernels to fine-tune them for diverse applications. The kernelmethods library fills that important void in the python ML ecosystem in a domain-agnostic fashion, allowing the sample data type to be anything from numerical, categorical, graphs or a combination of them. In addition, this library provides a number of well-defined classes to make various kernel-based operations efficient (for large scale datasets), modular (for ease of domain adaptation), and inter-operable (across different ecosystems). The library is available at https://github.com/raamana/kernelmethods.
In the variational relevance vector machine, the gamma distribution is representative as a hyperprior over the noise precision of automatic relevance determination prior. Instead of the gamma hyperprior, we propose to use the inverse gamma hyperprior with a shape parameter close to zero and a scale parameter not necessary close to zero. This hyperprior is associated with the concept of a weakly informative prior. The effect of this hyperprior is investigated through regression to non-homogeneous data. Because it is difficult to capture the structure of such data with a single kernel function, we apply the multiple kernel method, in which multiple kernel functions with different widths are arranged for input data. We confirm that the degrees of freedom in a model is controlled by adjusting the scale parameter and keeping the shape parameter close to zero. A candidate for selecting the scale parameter is the predictive information criterion. However the estimated model using this criterion seems to cause over-fitting. This is because the multiple kernel method makes the model a situation where the dimension of the model is larger than the data size. To select an appropriate scale parameter even in such a situation, we also propose an extended prediction information criterion. It is confirmed that a multiple kernel relevance vector regression model with good predictive accuracy can be obtained by selecting the scale parameter minimizing extended prediction information criterion.
Recently, classical kernel methods have been extended by the introduction of suitable tensor kernels so to promote sparsity in the solution of the underlying regression problem. Indeed they solve an l p -norm regularization problem, with p m/(m 1) and m even integer, which happens to be close to a lasso problem. However, a major drawback of the method is that storing tensors generally requires a considerable amount of memory, ultimately limiting its applicability. In this work we address this problem by proposing two advances. First, we directly reduce the memory requirement, by introducing a new and more efficient layout for storing the data. Second, we use Nyström-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost. Experiments, both on synthetic and real datasets show the effectiveness of the proposed improvements. Finally, we take care of implementing the code in C so to further speedup the computation.
Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification. However, the use of kernel methods for node classification, which is a related problem to graph representation learning, is still ill-posed and the state-of-the-art methods are heavily based on heuristics. Here, we present a novel theoretical kernel-based framework for node classification that can bridge the gap between these two representation learning problems on graphs. Our approach is motivated by graph kernel methodology but extended to learn the node representations capturing the structural information in a graph. We theoretically show that our formulation is as powerful as any positive semidefinite kernels.
Super resolution (SR) methods typically assume that the low-resolution (LR) image was downscaled from the unknown high-resolution (HR) image by a fixed ideal' downscaling kernel (e.g. However, this is rarely the case in real LR images, in contrast to synthetically generated SR datasets. When the assumed downscaling kernel deviates from the true one, the performance of SR methods significantly deteriorates. This gave rise to Blind-SR - namely, SR when the downscaling kernel ( SR-kernel'') is unknown. It was further shown that the true SR-kernel is the one that maximizes the recurrence of patches across scales of the LR image.