Evaluating multiple models using labeled and unlabeled data