Gradient Descent Provably Optimizes Over-parameterized Neural Networks
Du, Simon S., Zhai, Xiyu, Poczos, Barnabas, Singh, Aarti
Neural networks trained by first order methods have achieved a remarkable impact on many applications, but their theoretical properties are still mysteries. One of the empirical observation is even though the optimization objective function is non-convex and non-smooth, randomly initialized first order methods like stochastic gradient descent can still find a global minimum. Surprisingly, this property is not correlated with labels. In Zhang et al. [2016], authors replaced the true labels with randomly generated labels, but still found randomly initialized first order methods can always achieve zero training loss. A widely believed explanation on why a neural network can fit all training labels is because the neural network is over-parameterized. For example, Wide ResNet [Zagoruyko and Komodakis] uses 100x parameters than the number of training data and thus there must exist one such neural network of this architecture that can fit all training data. However, the existence does not imply why the network found by a randomly initialized first order method can fit all the data.
Oct-4-2018