$\bar{G}_{mst}$:An Unbiased Stratified Statistic and a Fast Gradient Optimization Algorithm Based on It

Chen, Aixiang

arXiv.org Machine Learning 

It is difficult to optimize a giant model with deep and wider layers. Similar to most optimization algorithms, training a deep model with gradient method (SGD-like Algorithms) has disadvantages such as easy to fall into local minima or saddle point and slow convergence speed. There have been a lot of researches on the improvement of the gradient method, and a considerable part of these researches focus on how to refine the search direction while keeping the iteration cost as low as possible to accelerate the convergence of the algorithm[10, 11, 12, 13, 14, 15, 16]. These improvements for the search direction are roughly divided into two categories. One is the momentum method[11] based on the principles of physics and the corresponding improved algorithms[12, 20, 21], the momentum method avoids excessive swing amplitude of the search track by retaining part of the potential energy of the original track to accelerate the convergence.