Second Order Optimization Made Practical
Anil, Rohan, Gupta, Vineet, Koren, Tomer, Regan, Kevin, Singer, Yoram
Second-order gradient methods are among the most powerful algorithms in mathematical optimization. Algorithms in this family use a preconditioner matrix to transform the gradient before applying each step. Classically, this involves computing or approximating the matrix of second-order derivatives, i.e, the Hessian, in the context of exact deterministic optimization (e.g., Fletcher, 2013; Lewis & Overton, 2013; Nocedal, 1980). In contrast, AdaGrad (Duchi et al., 2011) and related algorithms that target stochastic optimization use the covariance matrix of second-order gradient statistics to form the preconditioner. While second-order methods often have significantly better convergence properties than first-order methods, the size of typical problems prohibits their use in practice, as they require quadratic storage and cubic computation time for each gradient update. Thus, these methods not commonly seen in the present practice of optimization in machine learning, which is largely dominated by the simpler to implement first-order methods. Arguably, one of the greatest challenges of modern optimization is to bridge this gap between the theoretical and practical optimization and make second-order optimization more feasible to implement and deploy. In this paper, we attempt to contribute towards narrowing this gap between theory and practice, focusing on second-order adaptive methods. These methods can be thought of as full-matrix analogues of common adaptive algorithms of the family of AdaGrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014).
Feb-20-2020
- Country:
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Genre:
- Research Report (1.00)
- Technology: