Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Open in new window