vt 2
Appendix
In particular,SQuARM-SGD[45]can be viewed asCHOCO-SGD with momentum, but its theoretical convergence rate is slower than the originalCHOCO-SGD. We provide some examples of compression operators satisfying Definition 1 that are used in our experiments. Line 6), the penultimate line follows from W1 = 1, and the last line follows from the induction hypothesis at thet-th iteration. Line 3), in the second line we use the propertyofthemixingmatrix 1 W =1,andinthethirdline,weapplyYoung'sinequality(cf.(9)). Bounding Ωt2 in (14b) Similar to the derivation of (14a), by applying the update rule ofGt in BEER(Line 8),thedefinition ofcompression operators (Definition 1),andYoung'sinequality,we have It then boils down to establish (26).