Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $\mu$P Parametrization

Chen, Zixiang, Yang, Greg, Zhao, Qingyue, Gu, Quanquan

arXiv.org Machine Learning 

Deep learning has achieved remarkable success in various machine learning tasks, from image classification (Krizhevsky et al., 2012) and speech recognition (Hinton et al., 2012) to game playing (Silver et al., 2016). Yet this empirical success has posed a significant theoretical challenge: how can we explain the effectiveness of neural networks given their non-convex optimization landscape and over-parameterized nature? Traditional optimization and learning theory frameworks struggle to provide satisfactory explanations. A breakthrough came with the study of infinite-width neural networks, where the network behavior can be precisely characterized in the limit of infinite width. This theoretical framework has spawned several important approaches to understanding neural networks, with the Neural Tangent Kernel (NTK) emerging as a prominent example. Under the NTK parametrization (NTP) (Jacot et al., 2018), neural network training behaves like a linear model: the features learned during training in each layer remain essentially identical to