On the Optimization and Generalization of Multi-head Attention