Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law

Open in new window