Kernel methods are a class of algorithms for pattern analysis, whose best known member is the support vector machine (SVM). In its simplest form, the kernel trick means transforming data into another dimension that has a clear dividing margin between classes of data. (Wikipedia)
We consider the analysis of probability distributions through their associated covariance operators from reproducing kernel Hilbert spaces. We show that the von Neumann entropy and relative entropy of these operators are intimately related to the usual notions of Shannon entropy and relative entropy, and share many of their properties. They come together with efficient estimation algorithms from various oracles on the probability distributions. We also consider product spaces and show that for tensor product kernels, we can define notions of mutual information and joint entropies, which can then characterize independence perfectly, but only partially conditional independence. We finally show how these new notions of relative entropy lead to new upper-bounds on log partition functions, that can be used together with convex optimization within variational inference methods, providing a new family of probabilistic inference methods.
Recent empirical work has shown that hierarchical convolutional kernels inspired by convolutional neural networks (CNNs) significantly improve the performance of kernel methods in image classification tasks. A widely accepted explanation for the success of these architectures is that they encode hypothesis classes that are suitable for natural images. However, understanding the precise interplay between approximation and generalization in convolutional architectures remains a challenge. In this paper, we consider the stylized setting of covariates (image pixels) uniformly distributed on the hypercube, and fully characterize the RKHS of kernels composed of single layers of convolution, pooling, and downsampling operations. We then study the gain in sample efficiency of kernel methods using these kernels over standard inner-product kernels. In particular, we show that 1) the convolution layer breaks the curse of dimensionality by restricting the RKHS to `local' functions; 2) local pooling biases learning towards low-frequency functions, which are stable by small translations; 3) downsampling may modify the high-frequency eigenspaces but leaves the low-frequency part approximately unchanged. Notably, our results quantify how choosing an architecture adapted to the target function leads to a large improvement in the sample complexity.
Quantum kernel method is one of the key approaches to quantum machine learning, which has the advantages that it does not require optimization and has theoretical simplicity. By virtue of these properties, several experimental demonstrations and discussions of the potential advantages have been developed so far. However, as is the case in classical machine learning, not all quantum machine learning models could be regarded as kernel methods. In this work, we explore a quantum machine learning model with a deep parameterized quantum circuit and aim to go beyond the conventional quantum kernel method. In this case, the representation power and performance are expected to be enhanced, while the training process might be a bottleneck because of the barren plateaus issue. However, we find that parameters of a deep enough quantum circuit do not move much from its initial values during training, allowing first-order expansion with respect to the parameters. This behavior is similar to the neural tangent kernel in the classical literatures, and such a deep variational quantum machine learning can be described by another emergent kernel, quantum tangent kernel. Numerical simulations show that the proposed quantum tangent kernel outperforms the conventional quantum kernel method for an ansatz-generated dataset. This work provides a new direction beyond the conventional quantum kernel method and explores potential power of quantum machine learning with deep parameterized quantum circuits.
With noisy intermediate-scale quantum computers showing great promise for near-term applications, a number of machine learning algorithms based on parametrized quantum circuits have been suggested as possible means to achieve learning advantages. Yet, our understanding of how these quantum machine learning models compare, both to existing classical models and to each other, remains limited. A big step in this direction has been made by relating them to so-called kernel methods from classical machine learning. By building on this connection, previous works have shown that a systematic reformulation of many quantum machine learning models as kernel models was guaranteed to improve their training performance. In this work, we first extend the applicability of this result to a more general family of parametrized quantum circuit models called data re-uploading circuits. Secondly, we show, through simple constructions and numerical simulations, that models defined and trained variationally can exhibit a critically better generalization performance than their kernel formulations, which is the true figure of merit of machine learning tasks. Our results constitute another step towards a more comprehensive theory of quantum machine learning models next to kernel formulations.
Recently, with the digitalization of medicine, the utilization of real-world medical data collected from clinical sites has been attracting attention. In this study, quantum computing was applied to a linear non-Gaussian acyclic model to discover causal relationships from real-world medical data alone. Specifically, the independence measure of DirectLiNGAM, a causal discovery algorithm, was calculated using the quantum kernel and its accuracy on real-world medical data was verified. When DirectLiNGAM with the quantum kernel (qLiNGAM) was applied to real-world medical data, a case was confirmed in which the causal structure could be correctly estimated when the amount of data was small, which was not possible with existing methods. It is suggested that qLiNGAM may be able to discover new medical knowledge and contribute to the solution of medical problems, even when only a small amount of data is available.
Some response surface functions in complex engineering systems are usually highly nonlinear, unformed, and expensive-to-evaluate. To tackle this challenge, Bayesian optimization, which conducts sequential design via a posterior distribution over the objective function, is a critical method used to find the global optimum of black-box functions. Kernel functions play an important role in shaping the posterior distribution of the estimated function. The widely used kernel function, e.g., radial basis function (RBF), is very vulnerable and susceptible to outliers; the existence of outliers is causing its Gaussian process surrogate model to be sporadic. In this paper, we propose a robust kernel function, Asymmetric Elastic Net Radial Basis Function (AEN-RBF). Its validity as a kernel function and computational complexity are evaluated. When compared to the baseline RBF kernel, we prove theoretically that AEN-RBF can realize smaller mean squared prediction error under mild conditions. The proposed AEN-RBF kernel function can also realize faster convergence to the global optimum. We also show that the AEN-RBF kernel function is less sensitive to outliers, and hence improves the robustness of the corresponding Bayesian optimization with Gaussian processes. Through extensive evaluations carried out on synthetic and real-world optimization problems, we show that AEN-RBF outperforms existing benchmark kernel functions.
Spectral approximation and variational inducing learning for the Gaussian process are two popular methods to reduce computational complexity. However, in previous research, those methods always tend to adopt the orthonormal basis functions, such as eigenvectors in the Hilbert space, in the spectrum method, or decoupled orthogonal components in the variational framework. In this paper, inspired by quantum physics, we introduce a novel basis function, which is tunable, local and bounded, to approximate the kernel function in the Gaussian process. There are two adjustable parameters in these functions, which control their orthogonality to each other and limit their boundedness. And we conduct extensive experiments on open-source datasets to testify its performance. Compared to several state-of-the-art methods, it turns out that the proposed method can obtain satisfactory or even better results, especially with poorly chosen kernel functions.
Determinantal Point Process (DPPs) are statistical models for repulsive point patterns. Both sampling and inference are tractable for DPPs, a rare feature among models with negative dependence that explains their popularity in machine learning and spatial statistics. Parametric and nonparametric inference methods have been proposed in the finite case, i.e. when the point patterns live in a finite ground set. In the continuous case, only parametric methods have been investigated, while nonparametric maximum likelihood for DPPs -- an optimization problem over trace-class operators -- has remained an open question. In this paper, we show that a restricted version of this maximum likelihood (MLE) problem falls within the scope of a recent representer theorem for nonnegative functions in an RKHS. This leads to a finite-dimensional problem, with strong statistical ties to the original MLE. Moreover, we propose, analyze, and demonstrate a fixed point algorithm to solve this finite-dimensional problem. Finally, we also provide a controlled estimate of the correlation kernel of the DPP, thus providing more interpretability.
As the analytic tools become more powerful, and more data are generated on a daily basis, the issue of data privacy arises. This leads to the study of the design of privacy-preserving machine learning algorithms. Given two objectives, namely, utility maximization and privacy-loss minimization, this work is based on two previously non-intersecting regimes -- Compressive Privacy and multi-kernel method. Compressive Privacy is a privacy framework that employs utility-preserving lossy-encoding scheme to protect the privacy of the data, while multi-kernel method is a kernel based machine learning regime that explores the idea of using multiple kernels for building better predictors. The compressive multi-kernel method proposed consists of two stages -- the compression stage and the multi-kernel stage. The compression stage follows the Compressive Privacy paradigm to provide the desired privacy protection. Each kernel matrix is compressed with a lossy projection matrix derived from the Discriminant Component Analysis (DCA). The multi-kernel stage uses the signal-to-noise ratio (SNR) score of each kernel to non-uniformly combine multiple compressive kernels. The proposed method is evaluated on two mobile-sensing datasets -- MHEALTH and HAR -- where activity recognition is defined as utility and person identification is defined as privacy. The results show that the compression regime is successful in privacy preservation as the privacy classification accuracies are almost at the random-guess level in all experiments. On the other hand, the novel SNR-based multi-kernel shows utility classification accuracy improvement upon the state-of-the-art in both datasets. These results indicate a promising direction for research in privacy-preserving machine learning.
Many supervised learning problems involve high-dimensional data such as images, text, or graphs. In order to make efficient use of data, it is often useful to leverage certain geometric priors in the problem at hand, such as invariance to translations, permutation subgroups, or stability to small deformations. We study the sample complexity of learning problems where the target function presents such invariance and stability properties, by considering spherical harmonic decompositions of such functions on the sphere. We provide non-parametric rates of convergence for kernel methods, and show improvements in sample complexity by a factor equal to the size of the group when using an invariant kernel over the group, compared to the corresponding non-invariant kernel. These improvements are valid when the sample size is large enough, with an asymptotic behavior that depends on spectral properties of the group. Finally, these gains are extended beyond invariance groups to also cover geometric stability to small deformations, modeled here as subsets (not necessarily subgroups) of permutations.