On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

Open in new window