Review for NeurIPS paper: Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge

Neural Information Processing Systems 

The paper provides a new algorithm for Federated Learning with resource constrained edge devices. The algorithm adapts distillation based techniques (which are usually used for model compression from larger model to smaller model) to a kind of a two way knowledge transfer model that aids learning of local small neural networks on the edge devices and a larger global network on the server cloud. Methodologically the paper is novel, useful, and well written. But a few points raised by the reviewers are very pertinent and needs to be discussed in the final version 1. One key advantage & motivation for the model is stated as reduced communication - unfortunately this has not been empirically justified against FedAvg - the method has the potential for less frequent communication compared to FedAvg but this has not been validated empirically -- it would be good to have this information on the experiments reported - exchanging features over parameters is stated as an advantage, but I agree with R1&R3's concern that this may not be the case in now-standard networks on high resolution images where the per iteration communication scales as #samples x #hidden units (or features), which could be large.