Evaluating Large Language Models at Evaluating Instruction Following

Open in new window