Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Open in new window