Using Curvature Information for Fast Stochastic Search

Neural Information Processing Systems 

We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes effective use of cur(cid:173) vature information, requires only O(n) storage and computation, and delivers convergence rates close to the theoretical optimum. We demonstrate the technique on linear and large nonlinear back(cid:173) prop networks. Learning algorithms that perform gradient descent on a cost function can be for(cid:173) mulated in either stochastic (on-line) or batch form. Stochastic learning provides several advantages over batch learning.