AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Open in new window