Goto

Collaborating Authors

 conjugate kernel


Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

Neural Information Processing Systems

We study the eigenvalue distributions of the Conjugate Kernel and Neural Tangent Kernel associated to multi-layer feedforward neural networks. In an asymptotic regime where network width is increasing linearly in sample size, under random initialization of the weights, and for input samples satisfying a notion of approximate pairwise orthogonality, we show that the eigenvalue distributions of the CK and NTK converge to deterministic limits. The limit for the CK is described by iterating the Marcenko-Pastur map across the hidden layers. The limit for the NTK is equivalent to that of a linear combination of the CK matrices across layers, and may be described by recursive fixed-point equations that extend this Marcenko-Pastur map. We demonstrate the agreement of these asymptotic predictions with the observed spectra for both synthetic and CIFAR-10 training data, and we perform a small simulation to investigate the evolutions of these spectra over training.


Concentration of measure for non-linear random matrices with applications to neural networks and non-commutative polynomials

Adamczak, Radosław

arXiv.org Artificial Intelligence

We prove concentration inequalities for several models of non-linear random matrices. As corollaries we obtain estimates for linear spectral statistics of the conjugate kernel of neural networks and non-commutative polynomials in (possibly dependent) random matrices.


Thompson Sampling in Function Spaces via Neural Operators

Oliveira, Rafael, Wang, Xuesong, Chai, Kian Ming A., Bonilla, Edwin V.

arXiv.org Machine Learning

We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator's output. We assume that functional evaluations are inexpensive, while queries to the operator (such as running a high-fidelity simulator) are costly. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This strategy avoids explicit uncertainty quantification by treating trained neural operators as approximate samples from a Gaussian process. We provide novel theoretical convergence guarantees, based on Gaussian processes in the infinite-dimensional setting, under minimal assumptions. We benchmark our method against existing baselines on functional optimization tasks involving partial differential equations and other nonlinear operator-driven phenomena, demonstrating improved sample efficiency and competitive performance.



Review for NeurIPS paper: Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

Neural Information Processing Systems

The reviewers and I are all confident that this paper will be interesting to the NeurIPS community and should be accepted. In addition to the improvements suggested by the reviewers, I would encourage the authors to expand the description of how to unfold the recursion in Theorem 3.7. The discussion in Appendix A helps, but it is insufficient as it is missing crucial details that would clarify how to interpret some of the ambiguous notation. I think including a detailed worked example would be an important addition.


Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks

Neural Information Processing Systems

We study the eigenvalue distributions of the Conjugate Kernel and Neural Tangent Kernel associated to multi-layer feedforward neural networks. In an asymptotic regime where network width is increasing linearly in sample size, under random initialization of the weights, and for input samples satisfying a notion of approximate pairwise orthogonality, we show that the eigenvalue distributions of the CK and NTK converge to deterministic limits. The limit for the CK is described by iterating the Marcenko-Pastur map across the hidden layers. The limit for the NTK is equivalent to that of a linear combination of the CK matrices across layers, and may be described by recursive fixed-point equations that extend this Marcenko-Pastur map. We demonstrate the agreement of these asymptotic predictions with the observed spectra for both synthetic and CIFAR-10 training data, and we perform a small simulation to investigate the evolutions of these spectra over training.


Double-descent curves in neural networks: a new perspective using Gaussian processes

Harzli, Ouns El, Valle-Pérez, Guillermo, Louis, Ard A.

arXiv.org Machine Learning

Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterised regime. Here we use a neural network Gaussian process (NNGP) which maps exactly to a fully connected network (FCN) in the infinite width limit, combined with techniques from random matrix theory, to calculate this generalisation behaviour, with a particular focus on the overparameterised regime. We verify our predictions with numerical simulations of the corresponding Gaussian process regressions. An advantage of our NNGP approach is that the analytical calculations are easier to interpret. We argue that neural network generalization performance improves in the overparameterised regime precisely because that is where they converge to their equivalent Gaussian process.


Characteristic Kernels and Infinitely Divisible Distributions

Nishiyama, Yu, Fukumizu, Kenji

arXiv.org Machine Learning

We connect shift-invariant characteristic kernels to infinitely divisible distributions on $\mathbb{R}^{d}$. Characteristic kernels play an important role in machine learning applications with their kernel means to distinguish any two probability measures. The contribution of this paper is two-fold. First, we show, using the L\'evy-Khintchine formula, that any shift-invariant kernel given by a bounded, continuous and symmetric probability density function (pdf) of an infinitely divisible distribution on $\mathbb{R}^d$ is characteristic. We also present some closure property of such characteristic kernels under addition, pointwise product, and convolution. Second, in developing various kernel mean algorithms, it is fundamental to compute the following values: (i) kernel mean values $m_P(x)$, $x \in \mathcal{X}$, and (ii) kernel mean RKHS inner products ${\left\langle m_P, m_Q \right\rangle_{\mathcal{H}}}$, for probability measures $P, Q$. If $P, Q$, and kernel $k$ are Gaussians, then computation (i) and (ii) results in Gaussian pdfs that is tractable. We generalize this Gaussian combination to more general cases in the class of infinitely divisible distributions. We then introduce a {\it conjugate} kernel and {\it convolution trick}, so that the above (i) and (ii) have the same pdf form, expecting tractable computation at least in some cases. As specific instances, we explore $\alpha$-stable distributions and a rich class of generalized hyperbolic distributions, where the Laplace, Cauchy and Student-t distributions are included.