Scaling Data-Constrained Language Models

Neural Information Processing Systems 

We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.