variation
Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research
Emmenegger, Nicolas, Stahler, Ellery, Podimata, Chara
Many applications require statistically valid inference across many related "tasks", while using only a handful of high-quality labels per hypothesis. In AI evaluation, these tasks may correspond to model behaviors across prompts, subgroups, or hypotheses; in social science surveys, they may correspond to related questions, populations, or measurement conditions. Prediction-powered inference (PPI) uses abundant but inexpensive proxy measurements to improve inference from limited, "ground-truth" labels, but commonly used methods treat tasks independently and therefore fail to exploit shared structure across related tasks. This limitation is especially important in settings where only a small number of labels are available per task. To address this issue, we introduce a multi-task prediction-powered inference framework that uses labeled data from related tasks to improve power while preserving task-specific inference. Our methods exploit the shared structure in the proxy-ground-truth relationship through cross-task recalibration, while retaining within-task rectification and power tuning to construct accurate point estimates and confidence intervals. We prove that efficiency gains beyond power-tuned PPI are only possible when the proxy-ground-truth relationship contains nonlinear structure; affine cross-task recalibrations are asymptotically equivalent to using the original proxy. We complement our theoretical findings with experiments on synthetic and semi-synthetic datasets, as well as a case study auditing language models on election-related information during the 2024 U.S. presidential election. Using a large human-annotation study, we show that cross-task recalibration can substantially reduce confidence interval widths when labels are scarce.
Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift
Brima, Yusuf, Atemkeng, Marcellin, Kallon, Lansana Hassim, Niyukuri, David, Vacavant, Antoine, Saidu, Samuel, Chen, Ding-Geng
Background Childhood Anemia affects an estimated 40% of children aged 6-59 months globally and arises from heterogeneous nutritional, infectious, and socioeconomic factors that vary substantially across settings. This variability challenges the generalizability of predictive machine learning models, which often degrade under cross-population or temporal shifts. We investigated the utility a modern transformer-based tabular foundation model (TabPFN) as a complementatry framework with respect to supervised classical machine learning methods across diverse country contexts, with particular attention to data-scarce settings where surveillance capacity is most limited. Methods We conducted a multi-country prediction study using Demographic and Health Surveys (DHS) children's recode data from 16 countries spanning Africa, Asia, Latin America, the Caucasus, and the Middle East. The harmonized analytic cohort comprised of (n = 68,856)children aged 6-59 months with valid hemoglobin measurements. Anemia was defined using WHO age and altitude-adjusted thresholds and treated as a binary outcome. We trained Logistic Regression, XGBoost, and LightGBM models using standard supervised learning, and evaluated TabPFN v2.6 in an in-context learning setting. Performance was assessed using Area Under the Receiver Operating Characteristic Curve (AUC-ROC) and other standard classification metrics, with calibration evaluated via Brier score and expected calibration error (ECE). Uncertainty in performance estimates was quantified using bootstrap resampling to derive 95% confidence intervals. Robustness was assessed in a few-shot learning setting. Cross-population generalization was examined using leave-one-country-out (LOCO) validation and reverse-LOCO experiments to assess directional transferability. Subgroup analyses were conducted across five demographic strata: child age group, sex, maternal education, residence type, and household wealth quintile. Feature importance was assessed using standard linear and tree-based explainer SHAP values for the three supervised models and an adapted version of SHAP for TabPFN, aggregated across countries and examined at the country level. TabPFN also yielded the best probabilistic calibration across all 16 countries, achieving the lowest mean Brier score (0.203) and Expected Calibration Error (ECE = 0.042) of all models evaluated; LightGBM and Logistic Regression exhibited the greatest miscalibration, particularly at higher predicted probabilities. Under full-data conditions, within-country discrimination was moderate across all models (AUC-ROC 0.59-0.76) Under LOCO validation, performance declined modestly (AUC-ROC 0.58-0.69) Reverse-LOCO analyses revealed asymmetric and directional transferability, with epidemiologically diverse populations serving as more informative training sources and certain target populations remaining persistently difficult to predict regardless of model or training data.
Dimension-Uniform Discretization Analysis of Preconditioned Annealed Langevin Dynamics for Multimodal Gaussian Mixtures
Baldassari, Lorenzo, Garnier, Josselin, Solna, Knut, de Hoop, Maarten V.
Obtaining stable diffusion-based samplers in high- and infinite-dimensional settings is challenging because errors can accumulate across high-frequency coordinates and make the dynamics unstable under refinement of the finite-dimensional approximation of the underlying function-space problem. Discretization is a typical source of such errors, and preconditioning with a suitable spectral decay is one way to control their accumulation. In this paper, we study this problem for preconditioned annealed Langevin dynamics (ALD) applied to Gaussian mixtures. We first show that Euler-Maruyama (EM) discretization, by treating the stiff linear part of the annealed score with a forward Euler step, imposes a stability constraint coupling the preconditioner with the annealed covariance scale. Together with the conditions ensuring dimension-uniform control of the annealed dynamics, this constraint forces the initial smoothed law to remain uniformly close to the target across dimensions. We then consider an exponential-integrator scheme that integrates the stiff linear part of the annealed score exactly. Under explicit spectral summability conditions coupling the smoothing covariance, the component covariance spectra, and the preconditioner, we prove a dimension-uniform Kullback-Leibler (KL) bound for this scheme. This bound can be made arbitrarily small, uniformly in dimension, by allowing enough time for annealing and then refining the time mesh accordingly. Importantly, these conditions allow regimes in which the KL divergence between the target and the initial smoothed law diverges with dimension, showing that the restrictions imposed by EM are scheme-dependent rather than intrinsic to ALD.
Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling
Rudravaram, Gaurav, Zuo, Lianrui, Ramadass, Karthik, McMaster, Elyssa, Yoon, Jongyeon, Krishnan, Aravind R., Saunders, Adam M., Gao, Chenyu, Newlin, Nancy R., Kanakaraj, Praitayini, Held, Lori L. Beason, Bilgel, Murat, Barquero, Laura A., DArchangel, Micah, Nguyen, Tin Q., Cutting, Laurie B., Archer, Derek, Hohman, Timothy J., Moyer, Daniel C., Landman, Bennett A.
Acquisition differences across sites, scanners, and protocols in dMRI introduce variability that complicates structural connectome analysis. This motivates deep learning models that can represent high-dimensional connectomes in a low-dimensional space while explicitly separating acquisition-related effects from biological variation. Conventional dimensionality reduction methods model all variance as continuous, so acquisition effects often get absorbed into a continuous latent space. Recent hybrid latent-space models combine discrete and continuous components to address this, but typically require manual capacity tuning to ensure the discrete component captures the intended variability. We introduce an unsupervised framework that removes this manual tuning by architecturally annealing encoder outputs before decoding, allowing the model to adaptively balance discrete and continuous latent variables during training. To evaluate it, we curated a dataset of N=7,416 structural connectomes derived from dMRI, spanning ages 2 to 102 and 13 studies with 25 unique acquisition-parameter combinations. Of these, 5,900 are cognitively unimpaired, 877 have mild cognitive impairment (MCI), and 639 have Alzheimer's disease (AD). We compare against a standard VAE, PCA with k-means clustering, and hybrid models that anneal only through the loss function. Our architectural annealing produces stronger site learning (ARI=0.53, p<0.05) than these baselines. Results show that a hybrid continuous-discrete latent space, with architectural rather than loss-based annealing, provides a useful unsupervised mechanism for capturing acquisition variability in dMRI: by jointly modeling smooth and categorical structure, the Joint-VAE recovers clusters aligned with scanner and protocol differences.
Generative Modeling of Approximately Periodic Time Series by a Posterior-Weighted Gaussian Process
Reich, Elias, Messineo, Saverio, Huber, Stefan
Discrete automated processes in industrial and cyber-physical systems often exhibit a repetitive structure in which successive repetitions follow a common trajectory while differing in duration, amplitude, and fine-scale dynamics. Such \emph{approximately periodic} behavior poses a challenge for Gaussian Processes (GP) modeling: strictly periodic models suppress inter-repetition variability, while non-periodic models fail to capture the strong structural regularities required for generation. In this work, we propose a stochastic generative model for approximately periodic time series. The model is based on a GP whose posterior is modulated by a novel kernel. Our approach decouples intra-repetition structure from inter-repetition variability through a two-stage construction which yields a generative distribution with a identical mean function across repetitions, while allowing smooth variation between repetitions. The modeling choices are supported by an implementation in which realistic synthetic trajectories are generated from toy datasets.
Multiscale Euclidean Network Trajectories: Second-Moment Geometry, Attribution, and Change Points
A central challenge in dynamic network analysis is to represent temporal evolution in a way that is both geometrically meaningful and statistically identifiable. One approach embeds a sequence of network snapshots as trajectories in a Euclidean space and relates these trajectories to node embeddings. In multilayer and unfolded spectral constructions, however, node embeddings and their underlying latent positions are identifiable only up to general linear transformations. Although this ambiguity preserves edge probabilities, it can distort geometry and invalidate distance based temporal comparisons at both the trajectory and node-levels. We develop Multiscale Euclidean Network Trajectories (MENT), a framework for multiscale temporal trajectories based on second-moment geometry. By imposing an isotropic normalization on the anchor latent positions, we reduce the relevant ambiguity to orthogonal transformations and prevent distortion of the second-moment geometry. In this canonical representation, we define a trace variation distance and mode-wise variation distances along orthogonal directions, and use multidimensional scaling to obtain low-dimensional trajectories of time points at both global and mode-wise levels. The resulting trajectories support interpretation and inference. They admit mode-wise decompositions, support attribution of global and mode-wise temporal changes to nodes, and enable change point detection through 1D trajectories. We prove consistency of the proposed unfolded spectral embedding and of the induced temporal trajectories. Experiments on two synthetic and two real dynamic networks illustrate stable and interpretable recovery of temporal structure and show strong performance against existing change point detection baselines.
BGM-IV: an AI-powered Bayesian generative modeling approach for instrumental variable analysis
Instrumental-variable (IV) regression enables causal estimation under endogeneity, but modern IV problems often involve nonlinear structural effects and high-dimensional covariates. Existing nonlinear IV methods directly learn the causal relation in observed feature space or rely on learned representations within two-stage or moment-based procedures, which can struggle when the causal information is embedded in a high-dimensional representation. We propose BGM-IV, a latent Bayesian generative modeling approach that reframes nonlinear IV regression as posterior inference in a causally structured latent space. BGM-IV infers latent components that separately capture shared confounding structure, outcome-specific variation, treatment-specific variation, and covariate-only nuisance information. To account for endogeneity, BGM-IV replaces the confounded outcome likelihood with an IV-integrated pseudo-likelihood that averages over instrument-induced treatment values within the latent model. Across various benchmark datasets, BGM-IV remains competitive in the classical low-dimensional regime and performs best in high-dimensional covariate regimes. Together, these results show that structured latent generative modeling provides a principled and effective strategy to nonlinear IV estimation with rich covariates. The code of BGM-IV is available at https://github.com/liuq-lab/BGM-IV.
Information-geometric adaptive sampling for graph diffusion
Lu, Yuhui, Liu, Wenjing, Zhan, Kun
Standard diffusion models for graph generation typically rely on uniform time-stepping, an approach that overlooks the non-homogeneous dynamics of distributional evolution on complex manifolds. In this paper, we present an information-geometric framework that reinterprets the diffusion sampling trajectory as a parametric curve on a Riemannian manifold. Our key observation is that the Fisher-Rao metric provides a principled measure of the intrinsic distance. By analyzing this metric, we derive the Drift Variation Score (DVS), a geometry-aware indicator that quantifies the instantaneous rate of distributional change. Unlike prior heuristic-based adaptive samplers, our DVS solver enforces a constant informational speed on the statistical manifold, automatically maintaining a uniform rate of distributional change along the sampling trajectory. This equal arc-length strategy ensures that each discretization step contributes equally to the information speed. Theoretical analysis verifies that DVS characterizes the local stiffness of the sampling dynamics in the Fisher-Rao sense. Experimental results on molecule and social network generation show that DVS significantly improves structural fidelity and sampling efficiency. Code is at https://github.com/kunzhan/DVS