Learning and Transferring Sparse Contextual Bigrams with Linear Transformers