Distillation Scaling Laws
Busbridge, Dan, Shidani, Amitis, Weers, Floris, Ramapuram, Jason, Littwin, Etai, Webb, Russ
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.
Feb-12-2025
- Country:
- Africa
- Asia
- Europe
- Austria > Vienna (0.14)
- Denmark > Capital Region
- Copenhagen (0.04)
- Germany > Berlin (0.04)
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.14)
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.14)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Dominican Republic > Samaná
- Samaná (0.04)
- Puerto Rico > San Juan
- San Juan (0.04)
- United States
- New York > New York County
- New York City (0.04)
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Missouri (0.04)
- Texas (0.04)
- Maryland > Baltimore (0.04)
- California > San Diego County
- San Diego (0.04)
- New York > New York County
- Canada
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Technology: