AutoEval Done Right: Using Synthetic Data for Model Evaluation