Reviews: Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
–Neural Information Processing Systems
After rebuttal: I have carefully read the authors' response. Unfortunately, I do not think my concerns are well addressed. See Table 2 in "Regularizing and Optimizing LSTM Language Models" for comparison; (4) the performance of SGD on a single GTX1080 GPU does not tell how it performs with multiple workers (larger mini-batch size); (5) selecting learning rate based on the test error is not a good practice. For machine learning, we should select the hyper-parameters according to the accuracy on a hold-out validation set. Considering the above five points, I decide to keep my score unchanged.
Neural Information Processing Systems
Feb-5-2025, 22:54:10 GMT
- Technology: