Language Model Preference Evaluation with Multiple Weak Evaluators