Why are Adaptive Methods Good for Attention Models?

Open in new window