Toward Understanding Why Adam Converges Faster Than SGD for Transformers