AITopics | augrc

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Neural Information Processing SystemsMar-17-2026, 20:28:49 GMT

Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.81)

Add feedback

OvercomingCommonFlawsintheEvaluationof SelectiveClassificationSystems

Neural Information Processing SystemsFeb-7-2026, 07:09:07 GMT

Whilecurrentevaluationofthese systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the AUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them.

artificial intelligence, justification, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Europe > Germany (0.05)
North America > United States (0.04)
Europe > Spain > Andalusia > Granada Province > Granada (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Health & Medicine > Diagnostic Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

047c84ec50bd8ea29349b996fc64af4b-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 17:26:21 GMT

augrc, dataset, prediction, (12 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
North America > United States (0.04)
Europe > Spain > Andalusia > Granada Province > Granada (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
(2 more...)

Add feedback

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Neural Information Processing SystemsMay-26-2025, 15:07:20 GMT

Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the \mathrm{AUROC} in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ( \mathrm{AUGRC}), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of \mathrm{AUGRC} on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions.

artificial intelligence, machine learning, selective classification system, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Traub, Jeremias, Bungert, Till J., Lüth, Carsten T., Baumgartner, Michael, Maier-Hein, Klaus H., Maier-Hein, Lena, Jaeger, Paul F

arXiv.org Artificial IntelligenceJul-1-2024

Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

augrc, aurc, prediction, (14 more...)

arXiv.org Artificial Intelligence

2407.01032

Country: