Appendix A Comparisons with Existing NAS and KD Methods

Neural Information Processing Systems 

A.1 Training Cost Comparisons with Traditional Task-agnostic KD Methods Different hardwares (e.g., FPGA, CPU, GPU) have different resource constraints. AutoDistil generates a gallery of fully trained compressed student models with variable resource constraints (e.g., FLOPs, parameters) using NAS. One can simply choose a model from the trained pool given the resource constraint and only fine-tune on the downstream task. In contrast, traditional task-agnostic knowledge distillation (KD) methods (e.g., MiniLM) target specific compression rate and needs to be trained repeatedly for different student configurations (corresponding to different resource constraints). Therefore, AutoDistil has a much reduced amortized computation cost even considering traditional KD methods.