47a658229eb2368a99f1d032c8848542-Supplemental.pdf
–Neural Information Processing Systems
Based on the feedback from the reviewers, we perform the following additional experiments which 0 explore the robustness of the choice of buffer size in SGD RER, choice of step sizes for GLMtron 10 and the behavior of the said algorithms with heavy tailed noise with a similar setup as in Section 7. We first perform an experimental study about the robustness of SGD RER to the choice of buffer size in Figure 3a. Notice that the performance remains the same for a large range of buffer sizes ( 100 from to 2000). However the performance degrades when the buffer size is too large ( 10000). We believe this is the case since the number of buffers decreases as the buffer size increases and the output is averaged over too few number of iterates (In the case of B = 10000, the final output is just an average of 10 iterates). Theoretically, this largest step-size is L where Lis the largest eigenvalue of -1 the Hessian. In the case of GLMtron, it was experimentally observed that if the step size was chosen 10 to be about 1.5 times the step size reported in Section 7, the iterates diverged. Quasi Newton method essentially normalizes the gradient with the inverse of the Hessian (or rather an approximation of the Hessian) in order to let it converge faster with large step sizes. In Figure 4, we consider the same system as in Section 7 but with heavy tailed noise given by the student t distribution (scale ν = 4.1) so that the 4-th moment exists but higher moments do not. The typical behavior of Forward SGD, SGD-ER, SGD-RER and Quasi Newton methods seems to be similar to that observed in the Sub-Gaussian noise case. However, GLMtron requires much smaller step sizes to ensure convergence and hence it takes much longer.
Neural Information Processing Systems
Apr-25-2026, 17:13:11 GMT