Goto

Collaborating Authors

 sinequality







Appendix

Neural Information Processing Systems

In particular,SQuARM-SGD[45]can be viewed asCHOCO-SGD with momentum, but its theoretical convergence rate is slower than the originalCHOCO-SGD. We provide some examples of compression operators satisfying Definition 1 that are used in our experiments. Line 6), the penultimate line follows from W1 = 1, and the last line follows from the induction hypothesis at thet-th iteration. Line 3), in the second line we use the propertyofthemixingmatrix 1 W =1,andinthethirdline,weapplyYoung'sinequality(cf.(9)). Bounding Ωt2 in (14b) Similar to the derivation of (14a), by applying the update rule ofGt in BEER(Line 8),thedefinition ofcompression operators (Definition 1),andYoung'sinequality,we have It then boils down to establish (26).





cf0d02ec99e61a64137b8a2c3b03e030-Supplemental.pdf

Neural Information Processing Systems

Lemma 5. Let S = (Z1,...,Zn) be a collection ofn independent random variables andΦ be an arbitrary random variable defined on the same probability space. Furthermore, each of these summands has zero mean. Given a deterministic algorithmf, we consider the algorithm that adds Gaussian noise to the predictionsoff: fσ(z,x,R)=f(z,x)+ξ, (151) whereξ N(0,σ2Id). Thefigureinthemiddle repeats the experiment of Figure 1a while making the training algorithm stochastic by randomizing the seed. Table 1: The architecture of the 4-layer convolutional neural network used in MNIST 4 vs 9 classification tasks.