Evaluating Large Language Models at Evaluating Instruction Following