Global Convergence of Gradient Descent for Deep Linear Residual Networks

Nov-2-2019–arXiv.org Machine Learning

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an $\varepsilon$-optimal point in $O(L^3 \log(1/\varepsilon))$ iterations, which scales polynomially with the network depth $L$. Our result and the $\exp(\Omega(L))$ convergence time for the standard initialization (Xavier or near-identity) [Shamir, 2018] together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when $L$ is large.

gradient descent, initialization, matrix, (15 more...)

arXiv.org Machine Learning

Nov-2-2019

arXiv.org PDF

Add feedback

Country:
- North America
  - Canada (0.04)
  - United States > New Jersey
    - Mercer County > Princeton (0.04)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (1.00)
  - Statistical Learning > Gradient Descent (0.96)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found