l-bfg and neural
L-BFGS and neural nets • /r/MachineLearning
I've been doing a little bit of reading on optimization (from Nocedal's book) and have some questions about the prevalence of SGD and variants such as Adam for training neural nets. L-BFGS and other quasi-Newton methods have both theoretical and experimentally verified (PDF) faster convergence. Are there any good reasons training with L-BFGS is much less popular (or at least talked about) than SGD and variants? For the deep learning practitioners, have you ever tried using L-BFGS or other quasi-Newton or conjugate gradient methods? In a similar vein, has anyone experimented with doing a line search for optimal step size during each gradient descent step?