TAGC: Optimizing Gradient Communication in Distributed Transformer Training