General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

Ahn, Kwangjun, Magakyan, Gagik, Cutkosky, Ashok

arXiv.org Machine Learning 

Training large-scale neural network models, such as large language models, requires a well-designed optimization strategy to ensure stable and fast convergence. For instance, training typically requires a carefully designed optimizer, such as the Adam optimizer [Kingma and Ba, 2014], along with meticulously tuned learning rate scheduling. Recently, Defazio et al. [2024] introduced the schedule-free method, which achieves impressive training performance without any need for learning rate scheduling. In brief, the schedule-free method is an add-on scheme that can be applied to any chosen base optimizer, converting it into a schedule-free variant. While this method has shown strong empirical performance in training large neural network models, its theoretical analysis has, to date, been limited to the convex setting [Defazio et al., 2024]. Our aim is to extend the theoretical understanding of schedule-free methods to nonconvex optimization. In this work, as an initial step, we focus on the version where the base optimizer is chosen as SGD, referred to as schedule-free SGD.