Evaluating AI systems under uncertain ground truth: a case study in dermatology