Transformers Can Represent $n$-gram Language Models

Open in new window