Reviews: On the Convergence Rate of Training Recurrent Neural Networks

Feb-11-2025, 21:03:48 GMT–Neural Information Processing Systems

This paper shows that GD/SGD can minimize the training loss of RNNs with linear convergence rate assuming the hidden layer width is sufficiently large (polynomial in data size and time horizon length). In order to prove this, the authors show that within a small region around the initialization, the norm square of the gradient can be lower bounded by the function value (Theorem 3). The authors further show that the loss function is somewhat smooth (Theorem 4), which guarantees that moving in the negative gradient direction can decrease the function value. This paper builds new techniques to analyze multi-layer ReLU networks. This paper shows that with appropriate initialization, ReLU activations avoid exponential exploding and exponential vanishing.

initialization, step size, training recurrent neural network, (10 more...)

Neural Information Processing Systems

Feb-11-2025, 21:03:48 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)