Reviews: A Multi-Batch L-BFGS Method for Machine Learning

Neural Information Processing Systems 

In supervised learning, one is interested in minimizing the empirical risk where efficient optimization algorithms become the key. First-order methods such as stochastic gradient descent and its variants are reasonably well understood admitting efficient implementation and parallelization techniques. However there has been a recent interest in making second-order methods such as Newton's method or L-BFGS method efficient for such large-scale problems. This paper is along this direction, presenting a new variant of the stochastic L-BFGS method that is efficient and robust in mainly two settings: The first arises in the presence of node failures in a distributed computing environment, the second occurs when one uses an adaptive batch size that varies over iterations for accelerating learning. The main idea is to form the Hessian estimate based on the overlap between consecutive batches (the intuition why this works is that we have less limitation in choosing the second-order information matrix compared to an estimate of the true gradient).