Learning and Transferring Sparse Contextual Bigrams with Linear Transformers

Open in new window