We sincerely thank the reviewers for sharing their valuable feedback while pointing out weaknesses in our work and 1 suggesting presentations improvements

Neural Information Processing Systems 

R1 - Is this distillation only on the training set, or is there data augmentation? The model is trained solely on the training set. We follow the vanilla setup described in Hinton et al. [2014]. R1 - Can the authors comment how movement pruning might work for generative tasks? R2 - As with most work on pruning, it is not yet possible to realize efficiency gains on GPU.