Quantifying Variance in Evaluation Benchmarks