Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance