H-Fac: Memory-Efficient Optimization with Factorized Hamiltonian Descent

Nguyen, Son, Chen, Lizhang, Liu, Bo, Liu, Qiang

arXiv.org Artificial Intelligence 

Optimization algorithms play an indisputable role in the remarkable development of AI, especially in the realm of modern deep learning. In recent years, the emergence of breakthroughs in architectural innovation [3], as well as practical applications [37], has further promoted the necessity for embracing efficient training paradigms, which encompass optimization algorithms striking a balance between performance and manageable memory costs. Stochastic gradient descent (SGD) is widely regarded as the standard algorithm for training deep learning models, supported by extensive theoretical foundations [31, 32, 34, 43]. However, it requires thorough tuning of hyperparameters and frequently exhibits undesirable convergence rates when applied to many contemporary architectures [10, 36, 40]. Meanwhile, adaptive gradient methods such as Adam [17], AdaGrad [12], AMSGrad [29], etc., can adjust the learning rate for each parameter throughout the optimization process by utilizing cumulative second-order statistics.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found