Why are Adaptive Methods Good for Attention Models?