Rethinking gradient sparsification as total error minimization