On the Convergence Rate of Training Recurrent Neural Networks

Allen-Zhu, Zeyuan, Li, Yuanzhi, Song, Zhao

Mar-18-2020, 23:16:34 GMT–Neural Information Processing Systems

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the \emph{same} recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$.

convergence rate, multi-layer network, training recurrent neural network, (1 more...)

Neural Information Processing Systems

Mar-18-2020, 23:16:34 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (0.65)
  - Statistical Learning > Gradient Descent (0.63)