Can You Rely on Your Model Evaluation? Improving Model Evaluation with Synthetic Test Data