Towards Low-bit Communication for Tensor Parallel LLM Inference

Open in new window