On the Optimization and Generalization of Multi-head Attention

Open in new window