Review for NeurIPS paper: Why are Adaptive Methods Good for Attention Models?