Gradient Descent: Second Order Momentum and Saturating Error

Apr-6-2023, 19:27:16 GMT–Neural Information Processing Systems

Batch gradient descent, w(t) -7JdE/dw(t), conver es to a minimum of quadratic form with a time constant no better than '4Amax/ Amin where Amin and Amax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term w(t) -7JdE/dw(t) Q' w(t - 1) improves this to VAmax/ Amin, although only in the batch case. Here we show that second(cid:173) order momentum, w(t) -7JdE/dw(t) Q' w(t -1) (3 w(t - 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics.

gradient descent, order momentum and saturating error, second order momentum, (1 more...)

Neural Information Processing Systems

Apr-6-2023, 19:27:16 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)