Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia
Xie, Zeke, Wang, Xinrui, Zhang, Huishuai, Sato, Issei, Sugiyama, Masashi
Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Rate and Momentum, is the most popular stochastic optimizer for accelerating training of deep neural networks. But Adam often generalizes significantly worse than Stochastic Gradient Descent (SGD). It is still mathematically unclear how Adaptive Learning Rate and Momentum affect saddle-point escaping and minima selection. Based on the diffusion theoretical framework, we decouple the effects of Adaptive Learning Rate and Momentum on saddle-point escaping and minima selection. We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. In contrast, Momentum provides a momentum drift effect to help passing through saddle points, and almost does not affect flat minima selection. This mathematically explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. We design a novel adaptive optimizer named Adaptive Inertia Estimation (Adai), which uses parameter-wise adaptive inertia to accelerate training and provably favors flat minima as much as SGD. Our real-world experiments demonstrate that Adai can significantly outperform SGD and existing Adam variants.
Oct-28-2020