Goto

Collaborating Authors

 scalecom




The proposed LP filter is fundamentally different from previous weighted

Neural Information Processing Systems

Due to space constraints we only address major concerns; all suggestions will be included in the final version. Experimentally we've observed that when using previous weighted We will compare and cite related work (gTop-k) in the final draft. In sec.3 we assume min. SGD has a small critical batch size to approximate a full gradient descent iteration, no matter the size of dataset. Appendix-F shows ScaleCom's scalability in system performance; more Analogously, we perform filtering on the residual gradients (see eq.(5)) Connection will be discussed in the revised version.


ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Neural Information Processing Systems

Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms are expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing compression methods do not scale well to large scale distributed systems (due to gradient build-up) and / or lack evaluations in large datasets. To mitigate these issues, we propose a new compression technique, Scalable Sparsified Gradient Compression (ScaleComp), that (i) leverages similarity in the gradient distribution amongst learners to provide a commutative compressor and keep communication cost constant to worker number and (ii) includes low-pass filter in local gradient accumulations to mitigate the impacts of large batch size training and significantly improve scalability. Using theoretical analysis, we show that ScaleComp provides favorable convergence guarantees and is compatible with gradient all-reduce techniques. Furthermore, we experimentally demonstrate that ScaleComp has small overheads, directly reduces gradient traffic and provides high compression rates (70-150X) and excellent scalability (up to 64-80 learners and 10X larger batch sizes over normal training) across a wide range of applications (image, language, and speech) without significant accuracy loss.


A Observations in Local Memory Similarity

Neural Information Processing Systems

We observed local memory's similarity through Q-Q (quantile-quantile) plots as shown in Figure In Figure A1(a), the linearity of the points in Q-Q plot suggests that the worker 1's local This is consistent to our observations in pairwise cosine distance shown in Figure 2(a). This indicates that we can possibly use local worker's top-k One variant of Y oung's inequality is k x + y k A.1 global minimum of f ( x) 2, The quadrilateral identity is h x, y i = 1 2 k x k We provided the following table to explain section 3's main results and connected them to other parts of paper. Our theorem 1 shows this; indicates its applicability in distributed training. Lemma1: contraction property Lemma2: contraction in distributed setting Theorem1: ScaleCom's convergence rate same as SGD ( 1 / p T) Intuition Higher correlation between workers brings CL T - k closer to true top-k Require positive correlation between workers in distr. Fig.2 and 3 show high correlation so our contraction is close to true top-k Fig.2 and 3 show positive correlation between workers Table 1,2 (Fig4,5) verified ScaleCom's convergence same as baseline Each node is equipped with 2 IBM Power 9 processors clocked at 3.15 GHz.


ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training Chia-Y u Chen

Neural Information Processing Systems

Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing methods do not scale well to large scale distributed systems (due to gradient build-up) and/or fail to evaluate model fidelity (test accuracy) on large datasets.


The proposed LP filter is fundamentally different from previous weighted

Neural Information Processing Systems

Due to space constraints we only address major concerns; all suggestions will be included in the final version. Experimentally we've observed that when using previous weighted We will compare and cite related work (gTop-k) in the final draft. In sec.3 we assume min. SGD has a small critical batch size to approximate a full gradient descent iteration, no matter the size of dataset. Appendix-F shows ScaleCom's scalability in system performance; more Analogously, we perform filtering on the residual gradients (see eq.(5)) Connection will be discussed in the revised version.


ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Neural Information Processing Systems

Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms are expected to be severely communication constrained. To overcome this limitation, numerous gradient compression techniques have been proposed and have demonstrated high compression ratios. However, most existing compression methods do not scale well to large scale distributed systems (due to gradient build-up) and / or lack evaluations in large datasets. To mitigate these issues, we propose a new compression technique, Scalable Sparsified Gradient Compression (ScaleComp), that (i) leverages similarity in the gradient distribution amongst learners to provide a commutative compressor and keep communication cost constant to worker number and (ii) includes low-pass filter in local gradient accumulations to mitigate the impacts of large batch size training and significantly improve scalability. Using theoretical analysis, we show that ScaleComp provides favorable convergence guarantees and is compatible with gradient all-reduce techniques.