Re-evaluating Open-ended Evaluation of Large Language Models