Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
Arora, Sanjeev, Du, Simon S., Hu, Wei, Li, Zhiyuan, Wang, Ruosong
The well-known work of Zhang et al. (2017) highlighted intriguing experimental phenomena about deep net training - specifically, optimization and generalization-and asked whether theory could explain them. They showed that sufficiently powerful nets (with vastly more parameters than number of training samples) can attain zero training error, regardless of whether the data is properly labeled or randomly labeled. Obviously, trainingwith randomly labeled data cannot generalize, whereas training with properly labeled data generalizes. See Figure 2 replicating some of these results. Recent papers have begun to provide explanations, showing that gradient descent can allow an overparametrized multi-layernet to attain arbitrarily low training error on fairly generic datasets (Du et al., 2018a,c; Li & Liang, 2018; Allen-Zhu et al., 2018b; Zou et al., 2018), provided the amount of overparametrization isa high polynomial of the relevant parameters (i.e.
Jan-24-2019