An Empirical Analysis of Uncertainty in Large Language Model Evaluations