On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models