How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

Zhou, Mo, Ge, Rong

arXiv.org Artificial Intelligence 

Feature learning has long been considered to be a major advantage of neural networks. However, how gradient-based training algorithms can learn useful features is not well-understood. In particular, the most widely applied analysis for overparametrized neural networks is the neural tangent kernel(NTK)(Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019b). In this setting, the neurons don't move far from their initialization and the features are determined by the network architecture and random initialization (Chizat et al., 2019). While there are empirical and theoretical evidence on the limitation of NTK regime (Chizat et al., 2019; Arora et al., 2019), extending the analysis beyond the NTK regime has been challenging. For 2-layer networks, an alternative framework for analyzing overparametrized neural networks called mean-field analysis was introduced. Earlier mean-field analysis (e.g., Chizat and Bach, 2018; Mei et al., 2018) require either infinite or exponentially many neurons. Later works (e.g., Li et al., 2020; Ge et al., 2021; Bietti et al., 2022; Mahankali et al., 2024) can analyze the training dynamics of mildly overparametrized networks with polynomially many neurons with stronger assumptions on the ground-truth function.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found