Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Du, Simon S., Zhai, Xiyu, Poczos, Barnabas, Singh, Aarti

Oct-4-2018–arXiv.org Machine Learning

Neural networks trained by first order methods have achieved a remarkable impact on many applications, but their theoretical properties are still mysteries. One of the empirical observation is even though the optimization objective function is non-convex and non-smooth, randomly initialized first order methods like stochastic gradient descent can still find a global minimum. Surprisingly, this property is not correlated with labels. In Zhang et al. [2016], authors replaced the true labels with randomly generated labels, but still found randomly initialized first order methods can always achieve zero training loss. A widely believed explanation on why a neural network can fit all training labels is because the neural network is over-parameterized. For example, Wide ResNet [Zagoruyko and Komodakis] uses 100x parameters than the number of training data and thus there must exist one such neural network of this architecture that can fit all training data. However, the existence does not imply why the network found by a randomly initialized first order method can fit all the data.

artificial intelligence, machine learning, neural network, (14 more...)

arXiv.org Machine Learning

Oct-4-2018

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (1.00)
  - Statistical Learning > Gradient Descent (0.95)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found