Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Open in new window