Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Open in new window