augrc
OvercomingCommonFlawsintheEvaluationof SelectiveClassificationSystems
Whilecurrentevaluationofthese systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the AUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them.
- Europe > Germany (0.05)
- North America > United States (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Health & Medicine > Therapeutic Area (0.46)
- Health & Medicine > Diagnostic Medicine (0.46)
Overcoming Common Flaws in the Evaluation of Selective Classification Systems
Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- North America > United States (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine (0.93)
Overcoming Common Flaws in the Evaluation of Selective Classification Systems
Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the \mathrm{AUROC} in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ( \mathrm{AUGRC}), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of \mathrm{AUGRC} on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions.
Overcoming Common Flaws in the Evaluation of Selective Classification Systems
Traub, Jeremias, Bungert, Till J., Lüth, Carsten T., Baumgartner, Michael, Maier-Hein, Klaus H., Maier-Hein, Lena, Jaeger, Paul F
Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- North America > United States (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine (0.93)