On Evaluating Artificial Intelligence Systems for Medical Diagnosis