A Appendix A.1 Stochastic Rounding

Neural Information Processing Systems 

A realization of the stochastic rounding is shown in Figure 4. Here, a 24-bit single floating-point mantissa A.2 Representation mapping increases the gradients variance: Linear layer example A linear layer is essentially a matrix multiplication. Inequality (18) supports our Assumption 2 (iii,b) i.e. The proof goes along the proof of Bottou et al. Experimental results of this paper are run using the following number of GPUs. ResNet18 on CIFAR10 runs on 1 V100 GPUs when batch size is 128.