Statistical Learning
Confounder Detection via Treatment Intent: A New Observational Study Design
Plecko, Drago, Okanovic, Patrik, Hoefler, Torsten, Bareinboim, Elias
Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold standard for causal inference in many applied fields. However, RCTs are costly, time-consuming, and often constrained by ethical or practical limitations, motivating the need for causal methods able to draw conclusions from observational data. While such data is collected at ever larger scale, making its use for causal inference is often hindered by the fact that not all variables affecting treatment allocation and the outcome are observed - an issue known as unobserved confounding. In this paper, we introduce a new study design called confounder detection via treatment intent. The idea is to query a human expert who makes treatment decisions, and ask them to compare pairs of units proposed by a principled matching strategy, with the goal of eliciting unobserved variables that explain why treatment decisions differ. We provide a theoretical basis for such a procedure, ascertaining conditions under which such a study design may elicit unobserved confounders. Building on this newly established foundations, we study treatment effects of interventions in the intensive care unit (ICU). First, we show empirical evidence strongly indicating that electronic health records (EHRs) collected in ICUs are subject to unobserved confounding. By using clinical text notes as a proxy for physicians' knowledge and leveraging natural language processing, we provide a proof of concept for our methodology in a semi-synthetic environment with a known ground truth.
Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing
Sun, Rongyi, Sun, Wenguang, Zhao, Zinan
This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal methods rely on joint exchangeability, making it difficult to incorporate auxiliary information such as spatiotemporal or grouping structures. To overcome this limitation, we propose the structure-adaptive conformal q-value (SCQ), a significance index that integrates individual test evidence with structural patterns. We also develop pseudo-score-guided transductive automated model selection (P-TAMS), which adapts conformalized model selection to structured OOD testing across a toolbox of candidate models. Together, SCQ and P-TAMS form a unified framework under pairwise exchangeability, providing finite-sample error-rate control, improved power, and enhanced interpretability. Experiments on simulated and real data demonstrate that the proposed approach controls the false discovery rate and performs well across diverse settings.
Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift
Brima, Yusuf, Atemkeng, Marcellin, Kallon, Lansana Hassim, Niyukuri, David, Vacavant, Antoine, Saidu, Samuel, Chen, Ding-Geng
Background Childhood Anemia affects an estimated 40% of children aged 6-59 months globally and arises from heterogeneous nutritional, infectious, and socioeconomic factors that vary substantially across settings. This variability challenges the generalizability of predictive machine learning models, which often degrade under cross-population or temporal shifts. We investigated the utility a modern transformer-based tabular foundation model (TabPFN) as a complementatry framework with respect to supervised classical machine learning methods across diverse country contexts, with particular attention to data-scarce settings where surveillance capacity is most limited. Methods We conducted a multi-country prediction study using Demographic and Health Surveys (DHS) children's recode data from 16 countries spanning Africa, Asia, Latin America, the Caucasus, and the Middle East. The harmonized analytic cohort comprised of (n = 68,856)children aged 6-59 months with valid hemoglobin measurements. Anemia was defined using WHO age and altitude-adjusted thresholds and treated as a binary outcome. We trained Logistic Regression, XGBoost, and LightGBM models using standard supervised learning, and evaluated TabPFN v2.6 in an in-context learning setting. Performance was assessed using Area Under the Receiver Operating Characteristic Curve (AUC-ROC) and other standard classification metrics, with calibration evaluated via Brier score and expected calibration error (ECE). Uncertainty in performance estimates was quantified using bootstrap resampling to derive 95% confidence intervals. Robustness was assessed in a few-shot learning setting. Cross-population generalization was examined using leave-one-country-out (LOCO) validation and reverse-LOCO experiments to assess directional transferability. Subgroup analyses were conducted across five demographic strata: child age group, sex, maternal education, residence type, and household wealth quintile. Feature importance was assessed using standard linear and tree-based explainer SHAP values for the three supervised models and an adapted version of SHAP for TabPFN, aggregated across countries and examined at the country level. TabPFN also yielded the best probabilistic calibration across all 16 countries, achieving the lowest mean Brier score (0.203) and Expected Calibration Error (ECE = 0.042) of all models evaluated; LightGBM and Logistic Regression exhibited the greatest miscalibration, particularly at higher predicted probabilities. Under full-data conditions, within-country discrimination was moderate across all models (AUC-ROC 0.59-0.76) Under LOCO validation, performance declined modestly (AUC-ROC 0.58-0.69) Reverse-LOCO analyses revealed asymmetric and directional transferability, with epidemiologically diverse populations serving as more informative training sources and certain target populations remaining persistently difficult to predict regardless of model or training data.
Sample Complexity of Policy Gradient for Log-Growth Control
Pan, Qiuhua, Shen, Yukai, Zhang, Liwei, Chen, Cailian, Guan, Xinping
We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/η)$ when the noise density is known and $\tilde{O}(η^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.
Bilevel Optimization over Saddle Points of Zero-Sum Markov Games
Zheng, Zihao, King, Irwin, Lu, Songtao
Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.
CART Random Forests as Sequential Allocation over Random Opportunity Sets: A Stochastic-Control Theory of Ensemble Risk
Mei, Tianxing, Fan, Yingying, Leng, Mingming, Lv, Jinchi
CART random forests are among the most widely used modern predictive methods, with well-documented empirical success. Yet, at the mechanistic level, the algorithm is often treated as a black box because of its complexity. In this paper, we develop a stochastic-control perspective on feature-subsampled CART random forests, named CART random opportunity-set allocation (CART-ROSA). At each node, the random subset of features is interpreted as a random feasible action set, and the CART split rule as a masked-action allocation policy. This policy induces a controlled stochastic process over informative split-count states, whose terminal law determines both single-tree error and cross-tree interaction terms in the forest mean squared error (MSE). Such representation opens the black box of CART-forests by separating two design levers: the informative-opportunity rate induced by feature subsampling, and the contraction strength from the within-mask split policy. We establish that the CART policy is locally stabilizing: it contracts imbalances in informative split allocations and concentrates terminal tree geometry. At the system level, however, it can be globally suboptimal for the forest objective. Specializing to the linear model, we derive the MSE risk expansion explicitly. Our results show how an operations-research perspective makes tractable a theoretical gap difficult to access from the standard algorithmic description of CART forests.
Transformers Can Learn Posterior Predictive Distributions In-Context
Kang, Gyeonghun, Lee, Changwoo J., Cheng, Xiang
Prior-data fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.
Nonlinear and Heavy-Tailed Predictability in Transition-Energy Financial Markets
Gnandi, Kpante Emmanuel, Pokou, Fredy, Kamdem, Jules Sadefo
Transition-related financial markets are increasingly exposed to abrupt repricing episodes, elevated volatility, and heterogeneous macro-financial shocks. Under such conditions, conventional Gaussian-linear forecasting frameworks may provide an incomplete representation of the dependence structure linking fossil-energy, renewable-energy, technology, and utility-sector assets. This paper investigates whether transition-related financial returns exhibit residual non-linear predictability after controlling for heavy-tailed multivariate linear dynamics. To address this question, we develop a hybrid forecasting framework combining Student-t Vector Autoregressions with nonlinear recurrent residual learning architectures. The empirical analysis considers six major exchange-traded funds representing broad equity markets and key transition-sensitive sectors. The results reveal substantial departures from Gaussian-linear behavior, including excess kurtosis, volatility clustering, and remaining nonlinear dependence after econometric filtering. Out-of-sample forecasting experiments show that the proposed framework consistently improves predictive accuracy relative to conventional VAR models, standalone machine-learning methods, and alternative hybrid specifications. The forecasting gains become more pronounced during periods of macro-financial stress, particularly during the COVID-19 crisis and the Ukraine-related energy shock. Overall, the findings suggest that transition-related financial systems exhibit regime-sensitive and heavy-tailed predictive dynamics that are insufficiently captured by standard Gaussian-linear models alone.
Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates
Takemura, Kei, Matsuno, Ryuta, Sakuma, Keita
Maintaining predictive accuracy in non-stationary environments requires online model selection to adapt autonomously to unknown distribution shifts. However, existing tuning-free algorithms face a fundamental trade-off between robustness and agility. Specifically, to ensure dynamic regret bounds, they must restrict learning rates to small constants (e.g., $O(1)$). This restriction inevitably causes significant adaptation lag during abrupt changes. To resolve this, we propose a novel optimistic online mirror descent that utilizes safeguarded large learning rates up to $Θ(T)$, where $T$ is the number of rounds. Our key technical contribution is a post-hoc penalty mechanism that dynamically monitors unstable updates and excludes learning rates incurring excessive regret, eliminating the need for restrictive a priori constraints. We show that the cumulative penalty remains $O(\log T)$, allowing our algorithm to match near-optimal worst-case guarantees while achieving superior rates in benign cases. Empirical evaluations on synthetic and eleven diverse real-world datasets demonstrate that our approach reduces the adaptation lag from hundreds of rounds to a few rounds, consistently outperforming tuning-free baselines.
Constrained Bayesian Experimental Design via Online Planning
Guo, Yujia, Huang, Daolang, Zhang, Xinyu, Katt, Sammie, Kaski, Samuel, Bharti, Ayush
Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.