Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference