Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores