Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

Open in new window