miscalibration
Probability calibration for precipitation nowcasting
Kurki, Lauri, Cabrera, Yaniel, Karanko, Samu
Reliable precipitation nowcasting is critical for weather-sensitive decision-making, yet neural weather models (NWMs) can produce poorly calibrated probabilistic forecasts. Standard calibration metrics such as the expected calibration error (ECE) fail to capture miscalibration across precipitation thresholds. We introduce the expected thresholded calibration error (ETCE), a new metric that better captures miscalibration in ordered classes like precipitation amounts. We extend post-processing techniques from computer vision to the forecasting domain. Our results show that selective scaling with lead time conditioning reduces model miscalibration without reducing the forecast quality.
- Europe > Finland (0.05)
- North America > United States (0.04)
- Asia > China > Beijing > Beijing (0.04)
Gated Uncertainty-Aware Runtime Dual Invariants for Neural Signal-Controlled Robotics
Kim, Tasha, Jones, Oiwi Parker
Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.
- North America > United States (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (2 more...)
- Health & Medicine > Health Care Technology (0.66)
- Health & Medicine > Therapeutic Area > Neurology (0.48)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Sweden > Uppsala County > Uppsala (0.04)
- North America > United States > Maryland > Prince George's County > Hyattsville (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Research Report > Strength High (1.00)
- Research Report > Experimental Study (1.00)
A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Jeon, Sohyeon, Lee, Hyung-Chul
Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibratio n and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert - validated dataset, systematically comparing two representati ve LLMs -- one general and one domain - specialized -- across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline - normalized Relative Calibration Error (RCE) that enable s reliable cross - model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role - playing conditions, with calibration error persisting above clinically relevant thresholds. These findings und erscore the need for improved calibration, transparent code, and strategic prompt engineering for the development of reliable and explainable medical AI.
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Research Report > Strength High (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Diagnostic Medicine (0.93)
- Health & Medicine > Therapeutic Area > Neurology (0.64)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.40)
Enforcing Calibration in Multi-Output Probabilistic Regression with Pre-rank Regularization
Desobry, Naomi, Zhalieva, Elnura, Taieb, Souhaib Ben
Probabilistic models must be well calibrated to support reliable decision-making. While calibration in single-output regression is well studied, defining and achieving multivariate calibration in multi-output regression remains considerably more challenging. The existing literature on multivariate calibration primarily focuses on diagnostic tools based on pre-rank functions, which are projections that reduce multivariate prediction-observation pairs to univariate summaries to detect specific types of miscalibration. In this work, we go beyond diagnostics and introduce a general regularization framework to enforce multivariate calibration during training for arbitrary pre-rank functions. This framework encompasses existing approaches such as highest density region calibration and copula calibration. Our method enforces calibration by penalizing deviations of the projected probability integral transforms (PITs) from the uniform distribution, and can be added as a regularization term to the loss function of any probabilistic predictor. Specifically, we propose a regularization loss that jointly enforces both marginal and multivariate pre-rank calibration. We also introduce a new PCA-based pre-rank that captures calibration along directions of maximal variance in the predictive distribution, while also enabling dimensionality reduction. Across 18 real-world multi-output regression datasets, we show that unregularized models are consistently miscalibrated, and that our methods significantly improve calibration across all pre-rank functions without sacrificing predictive accuracy.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Sweden > Uppsala County > Uppsala (0.04)
- North America > United States > Maryland > Prince George's County > Hyattsville (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.93)
- Research Report > New Finding (0.68)
Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models
Wibbeke, Jelke, Schönfisch, Nico, Rohjans, Sebastian, Rauh, Andreas
In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.
- North America > United States (0.14)
- Europe > Germany > Lower Saxony > Oldenburg (0.04)
- Europe > Switzerland (0.04)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
- Asia > Singapore (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)