On the Variance of the Adaptive Learning Rate and Beyond