subgroup
SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction
Lundström-Imanov, Gustav Olaf Yunus Laitinen-Fredriksson, Cömert, Hafize Gonca
Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.
Precision Physical Activity Prescription via Reinforcement Learning for Functional Actions
Lin, Gefei, Miao, Rui, Sacheck, Jennifer, Zhang, Xiaoke
Physical activity (PA) plays an important role in maintaining and improving health. Daily steps have been a key PA measure that is easily accessible with common wearable devices. However, methods are lacking to recommend a personalized optimal distribution of daily steps over a period of time for the best of certain health biomarkers. In this paper, we fill this void based on the data from the All of Us Research Program which includes months of step counts as well as repeated measurements of key health biomarkers. We develop a new offline reinforcement learning (RL) algorithm to learn personalized and optimal PA distributions associated with cardiometabolic risk, where the action is a function representing the daily step distribution over a period of time. Simulation studies demonstrate the advantage of the proposed approach over existing continuous-action RL methods. The learned optimal policy from the All of Us data generally suggests people take more daily steps and also follow a more consistent pattern of PA over time while offering tailored recommendations for subgroups in blood glucose level, body mass index, blood pressure, age, and sex.
Improving the Efficiency of Subgroup Analysis in Randomized Controlled Trials with TMLE
Qiu, Sky, Nance, Nerissa, Phillips, Rachael, Tarp, Jens, Petersen, Maya, van der Laan, Mark
Subgroup analyses within randomized controlled trials are often underpowered due to limited sample sizes. We address this challenge by leveraging trial participants outside the subgroup of interest to augment estimation within the subgroup. Specifically, we study two Targeted Maximum Likelihood Estimators (TMLEs) that borrow information from non-subgroup participants within the same trial: a TMLE with pooled regression (TMLE-PR) and an Adaptive Targeted Maximum Likelihood Estimator (A-TMLE). Both estimators enable information sharing without relying on any external real-world data, thereby capitalizing on key strengths of the trial: most importantly, the protection against bias afforded by the randomized treatment, but also harmonized data collection, and consistent treatment and outcome definitions. The general strategy proposed here directly advances the priorities of key regulatory agencies, including the FDA, by improving the precision of subgroup-specific treatment effect estimates without introducing external sources of bias, thereby facilitating rigorous inference to support equitable labeling, access, and post-market evaluation. In a case study based on analysis of data from a cardiovascular outcome trial (LEADER, NCT01179048), we estimate the risk reduction of major adverse cardiac events (MACE) under liraglutide treatment among Black and Asian subgroups -- each comprising less than 10\% of the trial population -- using the proposed estimators that borrow information from the remainder of the trial. Using A-TMLE, in particular, we find estimated absolute MACE risk reductions of 1.6, 1.5, and 1.5 percentage points among Asian participants and 2.1, 2.0, and 2.1 percentage points among Black participants at 365, 540, and 730 days, respectively, with 95\% confidence intervals excluding the null at each time point.
Fit CATE Once: Model-Assisted Randomization Tests Without Sample Splitting
Randomization tests and flexible treatment-effect models offer complementary strengths for analyzing data from randomized panel experiments: the former provide valid inference under the known assignment mechanism, while the latter can capture complex patterns of effect heterogeneity. We develop model-assisted randomization tests that combine these strengths without sample splitting. The key idea is to estimate an unsigned version of the conditional average treatment effect (CATE) from the covariance structure of residualized outcomes, while leaving the realized assignments for randomization inference. The remaining sign can be chosen to best fit the observed outcomes. We establish identification and consistency for the proposed unsigned CATE estimators, as well as validity for the CATE-assisted randomization tests. Across synthetic and semi-synthetic experiments, the CATE-assisted randomization tests control Type I error and achieve higher power than covariate-adjusted and sample-split alternatives. Finally, we show that the assignment-free CATE estimates can be used to discover heterogeneous subgroups and test subgroup-specific treatment effects.
Adaptive auditing of AI systems with anytime-valid guarantees
Zhou, Siyu, Vossler, Patrick, Sivaraman, Venkatesh, Mai, Yifan, Feng, Jean
A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.
On Learning Fairness and Accuracy on Multiple Subgroups
We propose an analysis in fair learning that preserves the utility of the data while reducing prediction disparities under the criteria of group sufficiency. We focus on the scenario where the data contains multiple or even many subgroups, each with limited number of samples. As a result, we present a principled method for learning a fair predictor for all subgroups via formulating it as a bilevel objective. In the lower-level, the subgroup-specific predictors are learned through a small amount of data and the fair predictor. In the upper-level, the fair predictor is updated to be close to all subgroup specific predictors. We further prove that such a bilevel objective can effectively control the group sufficiency and generalization error. We evaluate the proposed framework on real-world datasets. Empirical evidence suggests the consistently improved fair predictions, as well as the comparable accuracy to the baselines.
Fair Canonical Correlation Analysis
This paper investigates fairness and bias in Canonical Correlation Analysis (CCA), a widely used statistical technique for examining the relationship between two sets of variables. We present a framework that alleviates unfairness by minimizing the correlation disparity error associated with protected attributes. Our approach enables CCA to learn global projection matrices from all data points while ensuring that these matrices yield comparable correlation levels to group-specific projection matrices. Experimental evaluation on both synthetic and real-world datasets demonstrates the efficacy of our method in reducing correlation disparity error without compromising CCA accuracy.