semiparametric efficiency
Adaptive Experimentation for Censored Survival Outcomes
Wang, Yuxin, Frauen, Dennis, Schweisthal, Jonas, Schröder, Maresa, Javurek, Emil, Feuerriegel, Stefan
Adaptive experimentation enables efficient estimation of causal effects, but existing methods are not designed for survival data with censoring, where event times are only partially observed (e.g., overall survival in cancer trials but with dropout). In this paper, we develop a novel framework for adaptive experimentation to estimate causal effects under right censoring. For this, we derive the semiparametric efficiency bound for the average survival effect curve as a function of the treatment allocation policy and thereby obtain a closed-form efficiency-optimal allocation policy. The policy generalizes classical Neyman allocation to survival settings by prioritizing patient strata where both event and censoring dynamics induce high uncertainty. Building on this, we propose the Adaptive Survival Estimator (ASE), an adaptive framework that learns the allocation policy and estimates the average survival effect curve sequentially. Our framework has three main benefits: (i) it accommodates arbitrary machine learning models for nuisance estimation; (ii) it is guided by a closed-form efficiency-optimal allocation policy; and (iii) it admits strong theoretical guarantees, including asymptotic normality via a martingale central limit theorem. We demonstrate our framework across various numerical experiments to show consistent efficiency gains over uniform randomization and censoring-agnostic baselines.
Augmented transfer regression learning for completely missing covariates
Large-scale population-level datasets, such as the UK Biobank and the All of Us Research Program, often lack covariates needed for a specific analysis, such as genetic or lifestyle measures, while related studies measure them. This creates a cross-population missing data problem in which covariates are completely unobserved in the target population, rather than partially missing within one dataset. We propose an augmented transfer regression learning method for this setting. The key identifying condition is a sub-population shift assumption: the joint distribution of the outcome and observed covariates may differ across source and target populations, but the conditional distribution of the missing covariates given observed variables is invariant. We combine importance-weighted estimating equations with imputation terms for first- and second-order moments of the missing covariates. The resulting estimator is doubly robust, remaining consistent if either the density ratio model or both imputation models are correctly specified. It is $n^{1/2}$-consistent and asymptotically normal, and attains the semiparametric efficiency bound when both nuisance models are correctly specified.
Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals
Dong, Zihan, Zhang, Zhixian, Zhou, Yang, Jin, Can, Wu, Ruijia, Zhang, Linjun
Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.
Beyond Demand Estimation: Consumer Surplus Evaluation via Cumulative Propensity Weights
Bian, Zeyu, Biggs, Max, Gao, Ruijiang, Qi, Zhengling
This paper develops a practical framework for using observational data to audit the consumer surplus effects of AI-driven decisions, specifically in targeted pricing and algorithmic lending. Traditional approaches first estimate demand functions and then integrate to compute consumer surplus, but these methods can be challenging to implement in practice due to model misspecification in parametric demand forms and the large data requirements and slow convergence of flexible nonparametric or machine learning approaches. Instead, we exploit the randomness inherent in modern algorithmic pricing, arising from the need to balance exploration and exploitation, and introduce an estimator that avoids explicit estimation and numerical integration of the demand function. Each observed purchase outcome at a randomized price is an unbiased estimate of demand and by carefully reweighting purchase outcomes using novel cumulative propensity weights (CPW), we are able to reconstruct the integral. Building on this idea, we introduce a doubly robust variant named the augmented cumulative propensity weighting (ACPW) estimator that only requires one of either the demand model or the historical pricing policy distribution to be correctly specified. Furthermore, this approach facilitates the use of flexible machine learning methods for estimating consumer surplus, since it achieves fast convergence rates by incorporating an estimate of demand, even when the machine learning estimate has slower convergence rates. Neither of these estimators is a standard application of off-policy evaluation techniques as the target estimand, consumer surplus, is unobserved. To address fairness, we extend this framework to an inequality-aware surplus measure, allowing regulators and firms to quantify the profit-equity trade-off. Finally, we validate our methods through comprehensive numerical studies.
Beyond the Average: Distributional Causal Inference under Imperfect Compliance
Byambadalai, Undral, Hirata, Tomu, Oka, Tatsushi, Yasui, Shota
We study the estimation of distributional treatment effects in randomized experiments with imperfect compliance. When participants do not adhere to their assigned treatments, we leverage treatment assignment as an instrumental variable to identify the local distributional treatment effect-the difference in outcome distributions between treatment and control groups for the subpopulation of compliers. We propose a regression-adjusted estimator based on a distribution regression framework with Neyman-orthogonal moment conditions, enabling robustness and flexibility with high-dimensional covariates. Our approach accommodates continuous, discrete, and mixed discrete-continuous outcomes, and applies under a broad class of covariate-adaptive randomization schemes, including stratified block designs and simple random sampling. We derive the estimator's asymptotic distribution and show that it achieves the semiparametric efficiency bound. Simulation results demonstrate favorable finite-sample performance, and we demonstrate the method's practical relevance in an application to the Oregon Health Insurance Experiment.
Characterization of Efficient Influence Function for Off-Policy Evaluation Under Optimal Policies
Reinforcement learning (RL) focusing on developing optimal policies for sequential decision-making to maximize long-term rewards, (Sutton & Barto, 2018) have been serving as more and more important frontier in various fields. A critical component of RL is off-policy evaluation (OPE), which estimates the mean reward of a policy, termed the evaluation policy, using data collected under another policy, known as the behavior policy. OPE is essential in offline RL, where only historical datasets are available, precluding new experiments (Luedtke & V an Der Laan, 2016; Agarwal et al., 2019; Uehara et al., 2022). Recent years have witnessed substantial progress in developing statistically efficient OPE methods, with various approaches demonstrating semiparametric efficiency under different model settings (Jiang & Li, 2016; Kallus & Uehara, 2020; Shi et al., 2021). However, all of these existing analyses focus on scenarios where the evaluation policy is fixed and predetermined. A more challenging yet practical scenario arises when the evaluation policy itself is estimated from data, particularly when this policy is designed to be optimal with respect to some criterion. In this context, the statistical properties of OPE become more complex due to the additional estimation uncertainty introduced by the policy optimization process. In contrast, in the causal inference literature, such phenomena have been studied extensively in the optimal treatment regime literature (Laber et al., 2014; Kosorok & Laber, 2019; Athey & Wager, 2021). These works have established important results regarding the estimation of value functions under optimal treatment rules, but their direct application to the sequential decision-making context of RL presents additional challenges due to the temporal dependencies and potentially infinite horizons involved.
Efficient Adaptive Experimentation with Non-Compliance
Oprescu, Miruna, Cho, Brian M, Kallus, Nathan
We study the problem of estimating the average treatment effect (ATE) in adaptive experiments where treatment can only be encouraged--rather than directly assigned--via a binary instrumental variable. Building on semiparametric efficiency theory, we derive the efficiency bound for ATE estimation under arbitrary, history-dependent instrument-assignment policies, and show it is minimized by a variance-aware allocation rule that balances outcome noise and compliance variability. Leveraging this insight, we introduce AMRIV--an \textbf{A}daptive, \textbf{M}ultiply-\textbf{R}obust estimator for \textbf{I}nstrumental-\textbf{V}ariable settings with variance-optimal assignment. AMRIV pairs (i) an online policy that adaptively approximates the optimal allocation with (ii) a sequential, influence-function-based estimator that attains the semiparametric efficiency bound while retaining multiply-robust consistency. We establish asymptotic normality, explicit convergence rates, and anytime-valid asymptotic confidence sequences that enable sequential inference. Finally, we demonstrate the practical effectiveness of our approach through empirical studies, showing that adaptive instrument assignment, when combined with the AMRIV estimator, yields improved efficiency and robustness compared to existing baselines.
Optimal Policy Adaptation under Covariate Shift
Liu, Xueqing, Yang, Qinwei, Tian, Zhaoqing, Guo, Ruocheng, Wu, Peng
Transfer learning of prediction models has been extensively studied, while the corresponding policy learning approaches are rarely discussed. In this paper, we propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets: one with full information from the source domain and the other from the target domain with only covariates. First, under the setting of covariate shift, we formulate the problem from a perspective of causality and present the identifiability assumptions for the reward induced by a given policy. Then, we derive the efficient influence function and the semiparametric efficiency bound for the reward. Based on this, we construct a doubly robust and semiparametric efficient estimator for the reward and then learn the optimal policy by optimizing the estimated reward. Moreover, we theoretically analyze the bias and the generalization error bound for the learned policy. Furthermore, in the presence of both covariate and concept shifts, we propose a novel sensitivity analysis method to evaluate the robustness of the proposed policy learning approach. Extensive experiments demonstrate that the approach not only estimates the reward more accurately but also yields a policy that closely approximates the theoretically optimal policy.