Overcoming Common Flaws in the Evaluation of Selective Classification Systems
–Neural Information Processing Systems
Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the AUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ( AUGRC), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of AUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
Neural Information Processing Systems
Dec-27-2025, 22:41:44 GMT
- Country:
- Europe
- Germany > Baden-Württemberg
- Karlsruhe Region > Heidelberg (0.04)
- Spain > Andalusia
- Granada Province > Granada (0.04)
- Germany > Baden-Württemberg
- North America > United States (0.04)
- Europe
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine
- Diagnostic Medicine (0.93)
- Therapeutic Area > Oncology (1.00)
- Health & Medicine
- Technology: