H-Fac: Memory-Efficient Optimization with Factorized Hamiltonian Descent
Nguyen, Son, Chen, Lizhang, Liu, Bo, Liu, Qiang
–arXiv.org Artificial Intelligence
Optimization algorithms play an indisputable role in the remarkable development of AI, especially in the realm of modern deep learning. In recent years, the emergence of breakthroughs in architectural innovation [3], as well as practical applications [37], has further promoted the necessity for embracing efficient training paradigms, which encompass optimization algorithms striking a balance between performance and manageable memory costs. Stochastic gradient descent (SGD) is widely regarded as the standard algorithm for training deep learning models, supported by extensive theoretical foundations [31, 32, 34, 43]. However, it requires thorough tuning of hyperparameters and frequently exhibits undesirable convergence rates when applied to many contemporary architectures [10, 36, 40]. Meanwhile, adaptive gradient methods such as Adam [17], AdaGrad [12], AMSGrad [29], etc., can adjust the learning rate for each parameter throughout the optimization process by utilizing cumulative second-order statistics.
arXiv.org Artificial Intelligence
Jun-17-2024
- Country:
- North America > United States
- Texas > Travis County
- Austin (0.04)
- Virginia (0.04)
- Texas > Travis County
- North America > United States
- Genre:
- Research Report (0.64)
- Technology: