Reliable Selection of Heterogeneous Treatment Effect Estimators

Guo, Jiayi, Gao, Zijun

arXiv.org Machine Learning 

The estimation of heterogeneous treatment effects (HTEs) has become a central topic across statistics, econometrics, and machine learning, with applications ranging from personalized medicine to policy evaluation [1, 2, 3]. A growing body of work has proposed flexible estimators to capture individual-level treatment heterogeneity, including tree-based methods [2], representation-learning approaches [4, 5], and meta-learners [6, 7]. Despite this abundance of methods, determining which estimator performs best for a given application remains an open and underexplored problem [8, 9]. A reliable selection mechanism is crucial for practitioners [10], as choosing suboptimal estimators can directly affect downstream decision-making [11]. Evaluating or comparing HTE estimators is inherently difficult because the ground truth is unobservable: for each individual, only one potential outcome is realized [12], while HTEs are defined as the difference between two. Due to the fundamental unobservability of the treatment effect, comparing two HTE estimators is already challenging, and the difficulty is further exacerbated when a collection of estimators are being compared simultaneously. To our knowledge, most papers that compare multiple HTE estimators rely on ground-truth or simulated values and use them to compute metrics such as the Precision in Estimation of Heterogeneous Effect (PEHE) and the A TE [13, 14]. However, these evaluation metrics are subject to fundamental limitations: ground-truth are unavailable in real-world observational studies, and simulated values depend critically on the chosen data-generating process and offer no formal statistical guarantees. In this paper, we develop a method for accurately selecting the best heterogeneous treatment effect estimator that operates without ground-truth information and provides provable error control.