Goto

Collaborating Authors

 miscalibration



Probability calibration for precipitation nowcasting

Kurki, Lauri, Cabrera, Yaniel, Karanko, Samu

arXiv.org Artificial Intelligence

Reliable precipitation nowcasting is critical for weather-sensitive decision-making, yet neural weather models (NWMs) can produce poorly calibrated probabilistic forecasts. Standard calibration metrics such as the expected calibration error (ECE) fail to capture miscalibration across precipitation thresholds. We introduce the expected thresholded calibration error (ETCE), a new metric that better captures miscalibration in ordered classes like precipitation amounts. We extend post-processing techniques from computer vision to the forecasting domain. Our results show that selective scaling with lead time conditioning reduces model miscalibration without reducing the forecast quality.


Gated Uncertainty-Aware Runtime Dual Invariants for Neural Signal-Controlled Robotics

Kim, Tasha, Jones, Oiwi Parker

arXiv.org Artificial Intelligence

Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.



A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

Jeon, Sohyeon, Lee, Hyung-Chul

arXiv.org Artificial Intelligence

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibratio n and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert - validated dataset, systematically comparing two representati ve LLMs -- one general and one domain - specialized -- across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline - normalized Relative Calibration Error (RCE) that enable s reliable cross - model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role - playing conditions, with calibration error persisting above clinically relevant thresholds. These findings und erscore the need for improved calibration, transparent code, and strategic prompt engineering for the development of reliable and explainable medical AI.


Enforcing Calibration in Multi-Output Probabilistic Regression with Pre-rank Regularization

Desobry, Naomi, Zhalieva, Elnura, Taieb, Souhaib Ben

arXiv.org Machine Learning

Probabilistic models must be well calibrated to support reliable decision-making. While calibration in single-output regression is well studied, defining and achieving multivariate calibration in multi-output regression remains considerably more challenging. The existing literature on multivariate calibration primarily focuses on diagnostic tools based on pre-rank functions, which are projections that reduce multivariate prediction-observation pairs to univariate summaries to detect specific types of miscalibration. In this work, we go beyond diagnostics and introduce a general regularization framework to enforce multivariate calibration during training for arbitrary pre-rank functions. This framework encompasses existing approaches such as highest density region calibration and copula calibration. Our method enforces calibration by penalizing deviations of the projected probability integral transforms (PITs) from the uniform distribution, and can be added as a regularization term to the loss function of any probabilistic predictor. Specifically, we propose a regularization loss that jointly enforces both marginal and multivariate pre-rank calibration. We also introduce a new PCA-based pre-rank that captures calibration along directions of maximal variance in the predictive distribution, while also enabling dimensionality reduction. Across 18 real-world multi-output regression datasets, we show that unregularized models are consistently miscalibrated, and that our methods significantly improve calibration across all pre-rank functions without sacrificing predictive accuracy.



Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models

Wibbeke, Jelke, Schönfisch, Nico, Rohjans, Sebastian, Rauh, Andreas

arXiv.org Machine Learning

In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.