Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer

Open in new window