Rater Equivalence: Evaluating Classifiers in Human Judgment Settings