Goto

Collaborating Authors

 logd


Communication-efficientDistributedSGDwith Sketching

Neural Information Processing Systems

However,theoretical and empirical evidence both suggest that there is a maximum mini-batch size beyond which the number of iterations required toconvergestops decreasing, andgeneralization error begins toincrease [Maetal.,2017,Lietal., 2014, Golmant et al., 2018, Shallue et al., 2018, Keskar et al., 2016, Hoffer et al., 2017]. In this paper, we aim instead to decrease the communication cost per worker.






63dc7ed1010d3c3b8269faf0ba7491d4-Supplemental.pdf

Neural Information Processing Systems

In this document, we provide details and supplementary materials that cannot fit into the main manuscript due to the page limit. The specific form ofcenter distribution isunknown, but we can still train a generatorG to approximate it. IfR(G,D,T)),wechooseλ=0, i.e., no restriction onR(G,D,T)), to obtain the minimal cost. IfR(G,D,T)) >, then a large λshould be applied as apenalization. According to the derivation of Eq. (3), we obtain arelaxed versionoftheintractableEq.(2),expressedasfollows: min Inknowledge distillation, student models arecrafted using unlabeled datasets, where only thesoft targets from teachers are utilized.


439d8c975f26e5005dcdbf41b0d84161-Paper.pdf

Neural Information Processing Systems

We further give "active local" versions of these heuristics: given atest pointx?,we show how the labelT(x?) With this information, we may decide thathwould not have been of much utility anyway, thereby saving ourselves the resources and effort to label the entire datasetS (and to runA).


Robustanddifferentiallyprivatemeanestimation

Neural Information Processing Systems

Each participating individual should be able tocontribute without the fearofleaking one'ssensitiveinformation. At the same time, thesystem should berobustinthepresence ofmalicious participants inserting corrupted data. Recent algorithmic advances in learning from shared data focus on either one of these threats, leaving the system vulnerable to the other.


Appendix

Neural Information Processing Systems

In this section, we provide further intuition about the proposed AdaQN method. In the next stage, with4m0 samples, we use the original Hessian inverse approximation 2Rm0(wm0) 1 and the new variablew2m0 for the BFGS updates. As Vn = O(1/n)(since n m0 = Ω(κ2logd)) and n = 2m, condition (38) is equivalent to (1/tn) tn (1/6.6). This parameter depends heavily on the variation/variance of the input features for linear models. Thus, we can focus on the diagonal components of these twomatrices only.