On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

Open in new window