Review for NeurIPS paper: Distributed Distillation for On-Device Learning

Neural Information Processing Systems 

Additional Feedback: - the empirical results do not look very convincing: the performance of distributed distillation is significantly worse than plain distributed SGD. The amount of communication required is substantially smaller, but comparable gains have been reached by federated averaging with C 1 [3] or by dynamic averaging [4] with (seemingly) far better model performance (on a fully connected network graph, though). I suggest comparing to those baselines on a fully connected network. On a not-fully connected network I suggest comparing to decentralized learning approaches [5,6]. The authors might argue that this has an advantages over federated averaging for non-convex problems: in federated averaging, averaging two models in different minima can lead to a resulting model that is way worse than each of the two local models.