Review for NeurIPS paper: Distributed Distillation for On-Device Learning

Feb-8-2025, 19:21:14 GMT–Neural Information Processing Systems

Additional Feedback: - the empirical results do not look very convincing: the performance of distributed distillation is significantly worse than plain distributed SGD. The amount of communication required is substantially smaller, but comparable gains have been reached by federated averaging with C 1 [3] or by dynamic averaging [4] with (seemingly) far better model performance (on a fully connected network graph, though). I suggest comparing to those baselines on a fully connected network. On a not-fully connected network I suggest comparing to decentralized learning approaches [5,6]. The authors might argue that this has an advantages over federated averaging for non-convex problems: in federated averaging, averaging two models in different minima can lead to a resulting model that is way worse than each of the two local models.

federated averaging, learning, on-device learning, (13 more...)

Neural Information Processing Systems

Feb-8-2025, 19:21:14 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)