Goto

Collaborating Authors

 propensity score model


Cross-Balancing for Data-Informed Design and Efficient Analysis of Observational Studies

Jin, Ying, Zubizarreta, José

arXiv.org Machine Learning

Causal inference starts with a simple idea: compare groups that differ by treatment, not much else. Traditionally, similar groups are constructed using only observed covariates; however, it remains a long-standing challenge to incorporate available outcome data into the study design while preserving valid inference. In this paper, we study the general problem of covariate adjustment, effect estimation, and statistical inference when balancing features are constructed or selected with the aid of outcome information from the data. We propose cross-balancing, a method that uses sample splitting to separate the error in feature construction from the error in weight estimation. Our framework addresses two cases: one where the features are learned functions and one where they are selected from a potentially high-dimensional dictionary. In both cases, we establish mild and general conditions under which cross-balancing produces consistent, asymptotically normal, and efficient estimators. In the learned-function case, cross-balancing achieves finite-sample bias reduction relative to plug-in-type estimators, and is multiply robust when the learned features converge at slow rates. In the variable-selection case, cross-balancing only requires a product condition on how well the selected variables approximate true functions. We illustrate cross-balancing in extensive simulations and an observational study, showing that careful use of outcome information can substantially improve both estimation and inference while maintaining interpretability.


Checklist 1. For all authors (a)

Neural Information Processing Systems

Do the main claims made in the abstract and introduction accurately reflect the paper's Did you discuss any potential negative societal impacts of your work? Did you state the full set of assumptions of all theoretical results? Did you include complete proofs of all theoretical results? B. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] The code and Did you specify all the training details (e.g., data splits, hyperparameters, how they were Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Did you include the total amount of compute and the type of resources used (e.g., type Did you mention the license of the assets?





A Unified Framework for Inference with General Missingness Patterns and Machine Learning Imputation

Chen, Xingran, McCormick, Tyler, Mukherjee, Bhramar, Wu, Zhenke

arXiv.org Machine Learning

Pre-trained machine learning (ML) predictions have been increasingly used to complement incomplete data to enable downstream scientific inquiries, but their naive integration risks biased inferences. Recently, multiple methods have been developed to provide valid inference with ML imputations regardless of prediction quality and to enhance efficiency relative to complete-case analyses. However, existing approaches are often limited to missing outcomes under a missing-completely-at-random (MCAR) assumption, failing to handle general missingness patterns under the more realistic missing-at-random (MAR) assumption. This paper develops a novel method which delivers valid statistical inference framework for general Z-estimation problems using ML imputations under the MAR assumption and for general missingness patterns. The core technical idea is to stratify observations by distinct missingness patterns and construct an estimator by appropriately weighting and aggregating pattern-specific information through a masking-and-imputation procedure on the complete cases. We provide theoretical guarantees of asymptotic normality of the proposed estimator and efficiency dominance over weighted complete-case analyses. Practically, the method affords simple implementations by leveraging existing weighted complete-case analysis software. Extensive simulations are carried out to validate theoretical results. The paper concludes with a brief discussion on practical implications, limitations, and potential future directions.




Self Balancing Neural Network: A Novel Method to Estimate Average Treatment Effect

Abdisa, Atomsa Gemechu, Zhou, Yingchun, Qiu, Yuqi

arXiv.org Machine Learning

In observational studies, confounding variables affect both treatment and outcome. Moreover, instrumental variables also influence the treatment assignment mechanism. This situation sets the study apart from a standard randomized controlled trial, where the treatment assignment is random. Due to this situation, the estimated average treatment effect becomes biased. To address this issue, a standard approach is to incorporate the estimated propensity score when estimating the average treatment effect. However, these methods incur the risk of misspecification in propensity score models. To solve this issue, a novel method called the "Self balancing neural network" (Sbnet), which lets the model itself obtain its pseudo propensity score from the balancing net, is proposed in this study. The proposed method estimates the average treatment effect by using the balancing net as a key part of the feedforward neural network. This formulation resolves the estimation of the average treatment effect in one step. Moreover, the multi-pseudo propensity score framework, which is estimated from the diversified balancing net and used for the estimation of the average treatment effect, is presented. Finally, the proposed methods are compared with state-of-the-art methods on three simulation setups and real-world datasets. It has been shown that the proposed self-balancing neural network shows better performance than state-of-the-art methods.


High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates

Weberpals, Janick, Shaw, Pamela A., Lin, Kueiyu Joshua, Wyss, Richard, Plasek, Joseph M, Zhou, Li, Ngan, Kerry, DeRamus, Thomas, Raman, Sudha R., Hammill, Bradley G., Lee, Hana, Toh, Sengwee, Connolly, John G., Dandreo, Kimberly J., Tian, Fang, Liu, Wei, Li, Jie, Hernández-Muñoz, José J., Schneeweiss, Sebastian, Desai, Rishi J.

arXiv.org Artificial Intelligence

Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.