Does compressing activations help model parallel training?