Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods
Zhou, Junyu, Wang, Puyu, Lei, Yunwen, Kloft, Marius, Ying, Yiming
A central mystery in deep learning is how neural networks, despite being highly non-convex and heavily overparameterized, are able to achieve near-zero training error while still generalizing well to unseen data. This paradox has sparked a surge of research aimed at understanding the convergence and generalization behavior of neural networks [1, 2, 6, 7, 15, 38, 41, 49]. The Neural Tangent Kernel (NTK), introduced by [20], has become one of a foundational tool for understanding the behavior of training dynamics for neural networks, especially those trained using gradient-based methods such as gradient descent (GD) and stochastic gradient descent (SGD). The core idea here is to linearize the neural network around its random initialization, which enables the evolution of the network during training to be closely approximated by a kernel method associated with the corresponding NTK. This framework establishes a powerful connection between the evolution of a neural network during training process and the behavior of kernel methods in a reproducing kernel Hilbert space (RKHS) induced by the NTK, allowing insights from the kernel methods to inform our understanding of neural networks. Following this perspective, the influential work [34] showed that for regression problems, shallow neural networks trained by SGD can achieve generalization performance on par with their kernel counterparts.
Jun-8-2026