Bayesian Inference
Individual Causal Inference with Structural Causal Model
Individual causal inference (ICI) uses causal inference methods to understand and predict the effects of interventions on individuals, considering their specific characteristics / facts. It aims to estimate individual causal effect (ICE), which varies across individuals. Estimating ICE can be challenging due to the limited data available for individuals, and the fact that most causal inference methods are population-based. Structural Causal Model (SCM) is fundamentally population-based. Therefore, causal discovery (structural learning and parameter learning), association queries and intervention queries are all naturally population-based. However, exogenous variables (U) in SCM can encode individual variations and thus provide the mechanism for individualized population per specific individual characteristics / facts. Based on this, we propose ICI with SCM as a "rung 3" causal inference, because it involves "imagining" what would be the causal effect of a hypothetical intervention on an individual, given the individual's observed characteristics / facts. Specifically, we propose the indiv-operator, indiv(W), to formalize/represent the population individualization process, and the individual causal query, P(Y | indiv(W), do(X), Z), to formalize/represent ICI. We show and argue that ICI with SCM is inference on individual alternatives (possible), not individual counterfactuals (non-actual).
HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics
Luettgau, Lennart, Coppock, Harry, Dubois, Magda, Summerfield, Christopher, Ududec, Cozmin
As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., < 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust parameter estimation. This paper offers a comprehensive introduction to HiBayES, including illustrative examples, comparisons to conventional statistical methods, and practical guidance for implementing multilevel Bayesian GLMs. Additionally, we provide a HiBayES software package [4] (Beta version) for out-of-the-box implementation.
Large Population Models
Many of society's most pressing challenges, from pandemic response to supply chain disruptions to climate adaptation, emerge from the collective behavior of millions of autonomous agents making decisions over time. Large Population Models (LPMs) offer an approach to understand these complex systems by simulating entire populations with realistic behaviors and interactions at unprecedented scale. LPMs extend traditional modeling approaches through three key innovations: computational methods that efficiently simulate millions of agents simultaneously, mathematical frameworks that learn from diverse real-world data streams, and privacy-preserving communication protocols that bridge virtual and physical environments. This allows researchers to observe how agent behavior aggregates into system-level outcomes and test interventions before real-world implementation. While current AI advances primarily focus on creating "digital humans" with sophisticated individual capabilities, LPMs develop "digital societies" where the richness of interactions reveals emergent phenomena. By bridging individual agent behavior and population-scale dynamics, LPMs offer a complementary path in AI research illuminating collective intelligence and providing testing grounds for policies and social innovations before real-world deployment. We discuss the technical foundations and some open problems here. LPMs are implemented by the AgentTorch framework (github.com/AgentTorch/AgentTorch)
Learning to Control Dynamical Agents via Spiking Neural Networks and Metropolis-Hastings Sampling
Safa, Ali, Mohsen, Farida, Al-Zawqari, Ali
Spiking Neural Networks (SNNs) offer biologically inspired, energy-efficient alternatives to traditional Deep Neural Networks (DNNs) for real-time control systems. However, their training presents several challenges, particularly for reinforcement learning (RL) tasks, due to the non-differentiable nature of spike-based communication. In this work, we introduce what is, to our knowledge, the first framework that employs Metropolis-Hastings (MH) sampling, a Bayesian inference technique, to train SNNs for dynamical agent control in RL environments without relying on gradient-based methods. Our approach iteratively proposes and probabilistically accepts network parameter updates based on accumulated reward signals, effectively circumventing the limitations of backpropagation while enabling direct optimization on neuromorphic platforms. We evaluated this framework on two standard control benchmarks: AcroBot and CartPole. The results demonstrate that our MH-based approach outperforms conventional Deep Q-Learning (DQL) baselines and prior SNN-based RL approaches in terms of maximizing the accumulated reward while minimizing network resources and training episodes.
POIFormer: A Transformer-Based Framework for Accurate and Scalable Point-of-Interest Attribution
Saxena, Nripsuta Ani, Hsu, Shang-Ling, Shetty, Mehul, Alkhadra, Omar, Shahabi, Cyrus, Horn, Abigail L.
Accurately attributing user visits to specific Points of Interest (POIs) is a foundational task for mobility analytics, personalized services, marketing and urban planning. However, POI attribution remains challenging due to GPS inaccuracies, typically ranging from 2 to 20 meters in real-world settings, and the high spatial density of POIs in urban environments, where multiple venues can coexist within a small radius (e.g., over 50 POIs within a 100-meter radius in dense city centers). Relying on proximity is therefore often insufficient for determining which POI was actually visited. We introduce \textsf{POIFormer}, a novel Transformer-based framework for accurate and efficient POI attribution. Unlike prior approaches that rely on limited spatiotemporal, contextual, or behavioral features, \textsf{POIFormer} jointly models a rich set of signals, including spatial proximity, visit timing and duration, contextual features from POI semantics, and behavioral features from user mobility and aggregated crowd behavior patterns--using the Transformer's self-attention mechanism to jointly model complex interactions across these dimensions. By leveraging the Transformer to model a user's past and future visits (with the current visit masked) and incorporating crowd-level behavioral patterns through pre-computed KDEs, \textsf{POIFormer} enables accurate, efficient attribution in large, noisy mobility datasets. Its architecture supports generalization across diverse data sources and geographic contexts while avoiding reliance on hard-to-access or unavailable data layers, making it practical for real-world deployment. Extensive experiments on real-world mobility datasets demonstrate significant improvements over existing baselines, particularly in challenging real-world settings characterized by spatial noise and dense POI clustering.
Last Layer Hamiltonian Monte Carlo
Vellenga, Koen, Steinhauer, H. Joe, Falkman, Göran, Andersson, Jonas, Sjögren, Anders
We explore the use of Hamiltonian Monte Carlo (HMC) sampling as a probabilistic last layer approach for deep neural networks (DNNs). While HMC is widely regarded as a gold standard for uncertainty estimation, the computational demands limit its application to large-scale datasets and large DNN architectures. Although the predictions from the sampled DNN parameters can be parallelized, the computational cost still scales linearly with the number of samples (similar to an ensemble). Last layer HMC (LL--HMC) reduces the required computations by restricting the HMC sampling to the final layer of a DNN, making it applicable to more data-intensive scenarios with limited computational resources. In this paper, we compare LL-HMC against five last layer probabilistic deep learning (LL-PDL) methods across three real-world video datasets for driver action and intention. We evaluate the in-distribution classification performance, calibration, and out-of-distribution (OOD) detection. Due to the stochastic nature of the probabilistic evaluations, we performed five grid searches for different random seeds to avoid being reliant on a single initialization for the hyperparameter configurations. The results show that LL--HMC achieves competitive in-distribution classification and OOD detection performance. Additional sampled last layer parameters do not improve the classification performance, but can improve the OOD detection. Multiple chains or starting positions did not yield consistent improvements.
Uncertainty quantification of a multi-component Hall thruster model at varying facility pressures
Marks, Thomas A., Eckels, Joshua D., Mora, Gabriel E., Gorodetsky, Alex A.
Bayesian inference is applied to calibrate and quantify prediction uncertainty in a coupled multi-component Hall thruster model at varying facility background pressures. The model, consisting of a cathode model, discharge model, and plume model, is used to simulate two thrusters across a range of background pressures in multiple vacuum test facilities. The model outputs include thruster performance metrics, one-dimensional plasma properties, and the angular distribution of the current density in the plume. The simulated thrusters include a magnetically shielded thruster, the H9, and an unshielded thruster, the SPT-100. After calibration, the model captures several key performance trends with background pressure, including changes in thrust and upstream shifts in the ion acceleration region. Furthermore, the model exhibits predictive accuracy to within 10\% when evaluated on flow rates and pressures not included in the training data, and the model can predict some performance characteristics across test facilities to within the same range. Evaluated on the same data as prior work [Eckels et al. 2024], the model reduced predictive errors in thrust and discharge current by greater than 50%. An extrapolation to on-orbit performance is performed with an error of 9\%, capturing trends in discharge current but not thrust. Possible extensions and improvements are discussed in the context of using data for predictive Hall thruster modeling across vacuum facilities.
CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk
Azizi, Ilia, Bodik, Juraj, Heiss, Jakob, Yu, Bin
Accurate uncertainty quantification is critical for reliable predictive modeling, especially in regression tasks. Existing methods typically address either aleatoric uncertainty from measurement noise or epistemic uncertainty from limited data, but not necessarily both in a balanced way. We propose CLEAR, a calibration method with two distinct parameters, $γ_1$ and $γ_2$, to combine the two uncertainty components for improved conditional coverage. CLEAR is compatible with any pair of aleatoric and epistemic estimators; we show how it can be used with (i) quantile regression for aleatoric uncertainty and (ii) ensembles drawn from the Predictability-Computability-Stability (PCS) framework for epistemic uncertainty. Across 17 diverse real-world datasets, CLEAR achieves an average improvement of 28.2% and 17.4% in the interval width compared to the two individually calibrated baselines while maintaining nominal coverage. This improvement can be particularly evident in scenarios dominated by either high epistemic or high aleatoric uncertainty.
Galerkin-ARIMA: A Two-Stage Polynomial Regression Framework for Fast Rolling One-Step-Ahead Forecasting
We introduce Galerkin-ARIMA, a novel time-series forecasting framework that integrates Galerkin projection techniques with the classical ARIMA model to capture potentially nonlinear dependencies in lagged observations. By replacing the fixed linear autoregressive component with a spline-based basis expansion, Galerkin-ARIMA flexibly approximates the underlying relationship among past values via ordinary least squares, while retaining the moving-average structure and Gaussian innovation assumptions of ARIMA. We derive closed-form solutions for both the AR and MA components using two-stage Galerkin projections, establish conditions for asymptotic unbiasedness and consistency, and analyze the bias-variance trade-off under basis-size growth. Complexity analysis reveals that, for moderate basis dimensions, our approach can substantially reduce computational cost compared to maximum-likelihood ARIMA estimation. Through extensive simulations on four synthetic processes-including noisy ARMA, seasonal, trend-AR, and nonlinear recursion series-we demonstrate that Galerkin-ARIMA matches or closely approximates ARIMA's forecasting accuracy while achieving orders-of-magnitude speedups in rolling forecasting tasks. These results suggest that Galerkin-ARIMA offers a powerful, efficient alternative for modeling complex time series dynamics in high-volume or real-time applications.
Mallows Model with Learned Distance Metrics: Sampling and Maximum Likelihood Estimation
Alimohammadi, Yeganeh, Asgari, Kiana
\textit{Mallows model} is a widely-used probabilistic framework for learning from ranking data, with applications ranging from recommendation systems and voting to aligning language models with human preferences~\cite{chen2024mallows, kleinberg2021algorithmic, rafailov2024direct}. Under this model, observed rankings are noisy perturbations of a central ranking $σ$, with likelihood decaying exponentially in distance from $σ$, i.e, $P (π) \propto \exp\big(-β\cdot d(π, σ)\big),$ where $β> 0$ controls dispersion and $d$ is a distance function. Existing methods mainly focus on fixed distances (such as Kendall's $τ$ distance), with no principled approach to learning the distance metric directly from data. In practice, however, rankings naturally vary by context; for instance, in some sports we regularly see long-range swaps (a low-rank team beating a high-rank one), while in others such events are rare. Motivated by this, we propose a generalization of Mallows model that learns the distance metric directly from data. Specifically, we focus on $L_α$ distances: $d_α(π,σ):=\sum_{i=1} |π(i)-σ(i)|^α$. For any $α\geq 1$ and $β>0$, we develop a Fully Polynomial-Time Approximation Scheme (FPTAS) to efficiently generate samples that are $ε$- close (in total variation distance) to the true distribution. Even in the special cases of $L_1$ and $L_2$, this generalizes prior results that required vanishing dispersion ($β\to0$). Using this sampling algorithm, we propose an efficient Maximum Likelihood Estimation (MLE) algorithm that jointly estimates the central ranking, the dispersion parameter, and the optimal distance metric. We prove strong consistency results for our estimators (for any values of $α$ and $β$), and we validate our approach empirically using datasets from sports rankings.