Distributed Distillation for On-Device Learning

Neural Information Processing Systems 

Transmitting model weights requires huge communication overhead and means only devices with identical model architectures can be included.