Reviews: L4: Practical loss-based stepsize adaptation for deep learning

Neural Information Processing Systems 

The paper proposes a scheme for adaptive choice of learning rate for stochastic gradients descent and its variants. The key idea is very simple and easy to implement: given the loss value L at the global minimum, L_min, the idea is to choose learning rate eta, such that the update along the gradient reaches L_min from the current point i.e. solving L(theta-eta*v) L_min in eta, where v is for example dL/dtheta in gradient descent. Finally, to make the adaptive learning rate pessimistic to the possible linearization error, the authors introduce a coefficient alpha, so the effective learning rate used by the optimizer is eta*alpha. The authors empirically show (on badly conditioned regression, MNIST, CIFAR-10, and neural computer) that using such adaptive scheme helps in two ways: 1. the optimization performance is less sensitive to the choice of the coefficient alpha vs the learning rate (in non-adaptive setting), and 2. the optimizer can reduce the loss faster or at worst in equal speed with commonly used optimizers. At the same time, the paper has some shortcomings as admitted by the authors: 1.