On Evaluating Performance of LLM Inference Serving Systems