Reviews: PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
–Neural Information Processing Systems
Update: I have carefully read the authors' rebuttal. I have raised by score to 6 from 5 to reflect their clarification about Figure 3 and Table 6. It still seems that the speedups of the current formulation are often not of great practical significance, except for the language model which was able to give 2x wall clock speedup. As another reviewer noted, it is disappointing that the overall training time is not reported in the main paper, instead of the average batch time, as that makes it unclear whether latency times and other overheads between batches might be a significant concern. The author rebuttal notes that Appendix C shows time-to-accuracy, which would be good to mention in the main paper. But those results still appear mixed: for CIFAR10 SGD beats Rank1 and seems only competitive with Ranks 2,4, whereas for Language model all Ranks seem to convincingly beat SGD.
Neural Information Processing Systems
Jan-27-2025, 11:28:03 GMT
- Technology: