The curious case of the test set AUROC

Roberts, Michael, Hazan, Alon, Dittmer, Sören, Rudd, James H. F., Schönlieb, Carola-Bibiane

arXiv.org Artificial Intelligence 

Test Set Whilst the size and complexity of ML models have rapidly and significantly increased over the past decade, the methods for assessing their performance have not kept pace. In particular, among the many potential performance metrics, the ML community stubbornly continues to use (a) the area under the receiver operating characteristic curve (AUROC) for a validation and test cohort (distinct from training data) or (b) the sensitivity and specificity for the test data at an optimal threshold determined from the validation ROC. Example validation and test set model output distributions, ROC curves coloured by threshold. We don't seek to discuss the individual The key strength of the ROC curve is its when evaluated on datasets from different shortcomings of the AUROC (e.g. Therefore, it is possible to extrapolation required for'degenerate' distributions) different thresholds, we gain great insight obtain consistently good AUROC values for However, a validation and test cohort of data whilst is a staple for ML researchers.