Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
Li, Yuanzhi, Ma, Tengyu, Zhang, Hongyang
Over-parameterized models are crucial in deep learning, but their workings are far from understood. Over-parameterization -- the technique of using more parameters than statistically necessary -- apparently improves the training: theoretical and empirical results have suggested that it can enhance the geometric properties of the optimization landscape in simplified settings [24, 18, 17, 34] and thus make it easier to train over-parameterized models. On the other hand, over-parameterization often doesn't hurt the test performance, even if the number of parameters is much larger than the number of examples. Large neural networks used in practice have enough expressiveness to fit any labels of the training datasets [42, 17]. The training objective function may have multiple global minima with almost zero training error, some of which generalize better than the others [21, 11]. However, local improvement algorithms such as stochastic gradient descent, starting with proper initialization, may prefer some generalizable local minima to the others and thus provide an implicit effect of regularization [38, 28, 19, 27, 41]. Such regularization seems to depend on the algorithmic choice, the initialization scheme, and certain intrinsic properties of the data. The phenomenon and intuition above can be theoretically fleshed out in the context of linear models [35], whereas less is known for nonlinear models whose training objectives are usually nonconvex. The very important work of Gunasekar et al. [16] initiates the study of low-rank matrix factorization models with over-parameterization and conjectures that gradient descent prefers small trace norm solution in over-parameterized models with thorough empirical evidences.
Mar-19-2018