Can Transformers Learn $n$-gram Language Models?