Goto

Collaborating Authors

 Ingolstadt


c82836ed448c41094025b4a872c5341e-Supplemental.pdf

Neural Information Processing Systems

Recently there has been significant theoretical progress on understanding the convergence andgeneralization ofgradient-based methods onnonconvexlosses withoverparameterized models. Nevertheless, manyaspectsofoptimization and generalization and in particular the critical role of small random initialization are not fully understood.


c82836ed448c41094025b4a872c5341e-Paper.pdf

Neural Information Processing Systems

Recently there has been significant theoretical progress on understanding the convergence andgeneralization ofgradient-based methods onnonconvexlosses withoverparameterized models. Nevertheless, manyaspectsofoptimization and generalization and in particular the critical role of small random initialization are not fully understood.




16bda725ae44af3bb9316f416bd13b1b-Paper.pdf

Neural Information Processing Systems

However, since this proof relies on the existence of a convergent subsequence, their proof does not reveal any rate forglobal convergence.



Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks

arXiv.org Machine Learning

Kolmogorov--Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(ฮต,ฮด)$-DP and obtain a utility bound of order $\sqrt{d}/(nฮต)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.