Transformer Language Models Handle Word Frequency in Prediction Head

Open in new window