The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Open in new window