On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models

Open in new window