Goto

Collaborating Authors

 calibration


Calibeating Prediction-Powered Inference

van der Laan, Lars, Van Der Laan, Mark

arXiv.org Machine Learning

We study semisupervised mean estimation with a small labeled sample, a large unlabeled sample, and a black-box prediction model whose output may be miscalibrated. A standard approach in this setting is augmented inverse-probability weighting (AIPW) [Robins et al., 1994], which protects against prediction-model misspecification but can be inefficient when the prediction score is poorly aligned with the outcome scale. We introduce Calibrated Prediction-Powered Inference, which post-hoc calibrates the prediction score on the labeled sample before using it for semisupervised estimation. This simple step requires no retraining and can improve the original score both as a predictor of the outcome and as a regression adjustment for semisupervised inference. We study both linear and isotonic calibration. For isotonic calibration, we establish first-order optimality guarantees: isotonic post-processing can improve predictive accuracy and estimator efficiency relative to the original score and simpler post-processing rules, while no further post-processing of the fitted isotonic score yields additional first-order gains. For linear calibration, we show first-order equivalence to PPI++. We also clarify the relationship among existing estimators, showing that the original PPI estimator is a special case of AIPW and can be inefficient when the prediction model is accurate, while PPI++ is AIPW with empirical efficiency maximization [Rubin et al., 2008]. In simulations and real-data experiments, our calibrated estimators often outperform PPI and are competitive with, or outperform, AIPW and PPI++. We provide an accompanying Python package, ppi_aipw, at https://larsvanderlaan.github.io/ppi-aipw/.


Calibrating conditional risk

Vasilyev, Andrey, Wang, Yikai, Li, Xiaocheng, Chen, Guanting

arXiv.org Machine Learning

We introduce and study the problem of calibrating conditional risk, which involves estimating the expected loss of a prediction model conditional on input features. We analyze this problem in both classification and regression settings and show that it is fundamentally equivalent to a standard regression task. For classification settings, we further establish a connection between conditional risk calibration and individual/conditional probability calibration, and develop theoretical insights for the performance metric. This reveals that while conditional risk calibration is related to existing uncertainty quantification problems, it remains a distinct and standalone machine learning problem. Empirically, we validate our theoretical findings and demonstrate the practical implications of conditional risk calibration in the learning to defer (L2D) framework. Our systematic experiments provide both qualitative and quantitative assessments, offering guidance for future research in uncertainty-aware decision-making.


Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Yadav, Akash, Adebiyi, Taiwo A., Zhang, Ruda

arXiv.org Machine Learning

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.


Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference

Arruda, Jonas, Chervet, Sophie, Staudt, Paula, Wieser, Andreas, Hoelscher, Michael, Sermet-Gaudelus, Isabelle, Binder, Nadine, Opatowski, Lulla, Hasenauer, Jan

arXiv.org Machine Learning

Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.


Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning

Müller, Adam T., Rögelein, Tobias, Stache, Nicolaj C.

arXiv.org Machine Learning

The deployment of deep neural networks in safety-critical systems necessitates reliable and efficient uncertainty quantification (UQ). A practical and widespread strategy for UQ is repurposing stochastic regularizers as scalable approximate Bayesian inference methods, such as Monte Carlo Dropout (MCD) and MC-DropBlock (MCDB). However, this paradigm remains under-explored for Stochastic Depth (SD), a regularizer integral to the residual-based backbones of most modern architectures. While prior work demonstrated its empirical promise for segmentation, a formal theoretical connection to Bayesian variational inference and a benchmark on complex, multi-task problems like object detection are missing. In this paper, we first provide theoretical insights connecting Monte Carlo Stochastic Depth (MCSD) to principled approximate variational inference. We then present the first comprehensive empirical benchmark of MCSD against MCD and MCDB on state-of-the-art detectors (YOLO, RT-DETR) using the COCO and COCO-O datasets. Our results position MCSD as a robust and computationally efficient method that achieves highly competitive predictive accuracy (mAP), notably yielding slight improvements in calibration (ECE) and uncertainty ranking (AUARC) compared to MCD. We thus establish MCSD as a theoretically-grounded and empirically-validated tool for efficient Bayesian approximation in modern deep learning.


CRPS-Optimal Binning for Univariate Conformal Regression

Toccaceli, Paolo

arXiv.org Machine Learning

We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with $O(n^2 \log n)$ precomputation and $O(n^2)$ storage; the globally optimal $K$-partition is recovered by a dynamic programme in $O(n^2 K)$ time. Minimisation of within-sample LOO-CRPS turns out to be inappropriate for selecting $K$ as it results in in-sample optimism. We instead select $K$ by $K$-fold cross-validation of test CRPS, which yields a U-shaped criterion with a well-defined minimum. Having selected $K^*$ and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level $\varepsilon$. The conformal prediction is transductive and data-efficient, as all observations are used for both partitioning and p-value calculation, with no need to reserve a hold-out set. On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, CQR-QRF, and conformalized isotonic distributional regression), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.


Towards Accurate and Calibrated Classification: Regularizing Cross-Entropy From A Generative Perspective

Zhan, Qipeng, Zhou, Zhuoping, Shen, Li

arXiv.org Machine Learning

Accurate classification requires not only high predictive accuracy but also well-calibrated confidence estimates. Yet, modern deep neural networks (DNNs) are often overconfident, primarily due to overfitting on the negative log-likelihood (NLL). While focal loss variants alleviate this issue, they typically reduce accuracy, revealing a persistent trade-off between calibration and predictive performance. Motivated by the complementary strengths of generative and discriminative classifiers, we propose Generative Cross-Entropy (GCE), which maximizes $p(x|y)$ and is equivalent to cross-entropy augmented with a class-level confidence regularizer. Under mild conditions, GCE is strictly proper. Across CIFAR-10/100, Tiny-ImageNet, and a medical imaging benchmark, GCE improves both accuracy and calibration over cross-entropy, especially in the long-tailed scenario. Combined with adaptive piecewise temperature scaling (ATS), GCE attains calibration competitive with focal-loss variants without sacrificing accuracy.


Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

Franzen, Courtney, Pourkamali-Anaraki, Farhad

arXiv.org Machine Learning

Neural network classifiers trained with cross-entropy loss achieve strong predictive accuracy but lack the capability to provide inherent predictive uncertainty estimates, thus requiring external techniques to obtain these estimates. In addition, softmax scores for the true class can vary substantially across independent training runs, which limits the reliability of uncertainty-based decisions in downstream tasks. Evidential Deep Learning aims to address these limitations by producing uncertainty estimates in a single pass, but evidential training is highly sensitive to design choices including loss formulation, prior regularization, and activation functions. Therefore, this work introduces an alternative Dirichlet parameter estimation strategy by applying a method of moments estimator to ensembles of softmax outputs, with an optional maximum-likelihood refinement step. This ensemble-based construction decouples uncertainty estimation from the fragile evidential loss design while also mitigating the variability of single-run cross-entropy training, producing explicit Dirichlet predictive distributions. Across multiple datasets, we show that the improved stability and predictive uncertainty behavior of these ensemble-derived Dirichlet estimates translate into stronger performance in downstream uncertainty-guided applications such as prediction confidence scoring and selective classification.


MEC: Machine-Learning-Assisted Generalized Entropy Calibration for Semi-Supervised Mean Estimation

Lee, Se Yoon, Kim, Jae Kwang

arXiv.org Machine Learning

Obtaining high-quality labels is costly, whereas unlabeled covariates are often abundant, motivating semi-supervised inference methods with reliable uncertainty quantification. Prediction-powered inference (PPI) leverages a machine-learning predictor trained on a small labeled sample to improve efficiency, but it can lose efficiency under model misspecification and suffer from coverage distortions due to label reuse. We introduce Machine-Learning-Assisted Generalized Entropy Calibration (MEC), a cross-fitted, calibration-weighted variant of PPI. MEC improves efficiency by reweighting labeled samples to better align with the target population, using a principled calibration framework based on Bregman projections. This yields robustness to affine transformations of the predictor and relaxes requirements for validity by replacing conditions on raw prediction error with weaker projection-error conditions. As a result, MEC attains the semiparametric efficiency bound under weaker assumptions than existing PPI variants. Across simulations and a real-data application, MEC achieves near-nominal coverage and tighter confidence intervals than CF-PPI and vanilla PPI.


Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters

Xiang, Haotian, Li, Bingcong, Lu, Qin

arXiv.org Machine Learning

When deploying large language models (LLMs) to safety-critical applications, uncertainty quantification (UQ) is of utmost importance to self-assess the reliability of the LLM-based decisions. However, such decisions typically suffer from overconfidence, particularly after parameter-efficient fine-tuning (PEFT) for downstream domain-specific tasks with limited data. Existing methods to alleviate this issue either rely on Laplace approximation based post-hoc framework, which may yield suboptimal calibration depending on the training trajectory, or variational Bayesian training that requires multiple complete forward passes through the entire LLM backbone at inference time for Monte Carlo estimation, posing scalability challenges for deployment. To address these limitations, we build on the Bayesian last layer (BLL) model, where the LLM-based deterministic feature extractor is followed by random last layer parameters for uncertainty reasoning. Since existing low-rank adapters (LoRA) for PEFT have limited expressiveness due to rank collapse, we address this with Polar-decomposed Low-rank Adapter Representation (PoLAR), an orthogonalized parameterization paired with Riemannian optimization to enable more stable and expressive adaptation. Building on this PoLAR-BLL model, we leverage the variational (V) inference framework to put forth a scalable Bayesian fine-tuning approach which jointly seeks the PoLAR parameters and approximate posterior of the last layer parameters via alternating optimization. The resulting PoLAR-VBLL is a flexible framework that nicely integrates architecture-enhanced optimization with scalable Bayesian inference to endow LLMs with well-calibrated UQ. Our empirical results verify the effectiveness of PoLAR-VBLL in terms of generalization and uncertainty estimation on both in-distribution and out-of-distribution data for various common-sense reasoning tasks.