Can Transformers Learn $n$-gram Language Models?

Open in new window