The Unified Balance Theory of Second-Moment Exponential Scaling Optimizers in Visual Tasks
–arXiv.org Artificial Intelligence
Existing first-order optimizers mainly include two branches: classical optimizers represented by Stochastic Gradient Descent (SGD) and adaptive optimizers represented by Adam, along with their many derivatives. The debate over the merits and demerits of these two types of optimizers has persisted for a decade. In practical experience, it is generally considered that SGD is more suitable for tasks like Computer Vision(CV), while adaptive optimizers are widely used in tasks with sparse gradients, such as Large Language Models(LLM). Although adaptive optimizers always offer better convergence speeds in almost all tasks, they can lead to over-fitting in some cases, resulting in poorer generalization performance compared to SGD in certain tasks. Even in Large Language Models, Adam continues to face challenges, and its original strategy may not always have an advantage due to the introduction of improvements such as gradient clipping. With a wide variety of optimization methods available, it is essential to introduce a unified, interpretable theory. This paper will discuss under the framework of first-order optimizers and, through the intervention of the balance theory, will for the first time propose a unified strategy to integrate all first-order optimization methods.
arXiv.org Artificial Intelligence
May-28-2024