Transcending Scaling Laws with 0.1% Extra Compute