Plotting

 Shi, Chengchun


Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

arXiv.org Machine Learning

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.


Robust Offline Policy Evaluation and Optimization with Heavy-Tailed Rewards

arXiv.org Machine Learning

This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation (OPE) and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions.


Blessing from Human-AI Interaction: Super Reinforcement Learning in Confounded Environments

arXiv.org Artificial Intelligence

As AI becomes more prevalent throughout society, effective methods of integrating humans and AI systems that leverage their respective strengths and mitigate risk have become an important priority. In this paper, we introduce the paradigm of super reinforcement learning that takes advantage of Human-AI interaction for data driven sequential decision making. This approach utilizes the observed action, either from AI or humans, as input for achieving a stronger oracle in policy learning for the decision maker (humans or AI). In the decision process with unmeasured confounding, the actions taken by past agents can offer valuable insights into undisclosed information. By including this information for the policy search in a novel and legitimate manner, the proposed super reinforcement learning will yield a super-policy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., past agents' actions). We call this stronger oracle a blessing from human-AI interaction. Furthermore, to address the issue of unmeasured confounding in finding super-policies using the batch data, a number of nonparametric and causal identifications are established. Building upon on these novel identification results, we develop several super-policy learning algorithms and systematically study their theoretical properties such as finite-sample regret guarantee. Finally, we illustrate the effectiveness of our proposal through extensive simulations and real-world applications.


Off-policy Evaluation in Doubly Inhomogeneous Environments

arXiv.org Artificial Intelligence

Reinforcement learning (RL, Sutton and Barto, 2018) aims to optimize an agent's long-term reward by learning an optimal policy that determines the best action to take under every circumstance. RL is closely related to the dynamic treatment regimens (DTR) or adaptive treatment strategies in statistical research for precision medicine (Murphy, 2003; Robins, 2004; Qian and Murphy, 2011; Kosorok and Moodie, 2015; Tsiatis et al., 2019; Qi et al., 2020; Zhou et al., 2022a), which seeks to obtain the optimal treatment policy in finite horizon settings with a few treatment stages that maximizes patients' expected outcome. Nevertheless, statistical methods for DTR mentioned above normally cannot handle large or infinite horizon settings. They require the number of trajectories to tend to infinity to achieve estimation consistency, unlike RL, which works even with finite number of trajectories under certain conditions. In addition to precision medicine, RL has been applied to various fields, such as games (Silver et al., 2016), ridesharing (Xu et al., 2018), mobile health (Liao et al., 2021) and robotics (Levine et al., 2020).


A Reinforcement Learning Framework for Dynamic Mediation Analysis

arXiv.org Machine Learning

Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset.


Testing for the Markov Property in Time Series via Deep Conditional Generative Learning

arXiv.org Artificial Intelligence

The Markov property is widely imposed in analysis of time series data. Correspondingly, testing the Markov property, and relatedly, inferring the order of a Markov model, are of paramount importance. In this article, we propose a nonparametric test for the Markov property in high-dimensional time series via deep conditional generative learning. We also apply the test sequentially to determine the order of the Markov model. We show that the test controls the type-I error asymptotically, and has the power approaching one. Our proposal makes novel contributions in several ways. We utilize and extend state-of-the-art deep generative learning to estimate the conditional density functions, and establish a sharp upper bound on the approximation error of the estimators. We derive a doubly robust test statistic, which employs a nonparametric estimation but achieves a parametric convergence rate. We further adopt sample splitting and cross-fitting to minimize the conditions required to ensure the consistency of the test. We demonstrate the efficacy of the test through both simulations and the three data applications.


Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing

arXiv.org Artificial Intelligence

Many modern tech companies, such as Google, Uber, and Didi, utilize online experiments (also known as A/B testing) to evaluate new policies against existing ones. While most studies concentrate on average treatment effects, situations with skewed and heavy-tailed outcome distributions may benefit from alternative criteria, such as quantiles. However, assessing dynamic quantile treatment effects (QTE) remains a challenge, particularly when dealing with data from ride-sourcing platforms that involve sequential decision-making across time and space. In this paper, we establish a formal framework to calculate QTE conditional on characteristics independent of the treatment. Under specific model assumptions, we demonstrate that the dynamic conditional QTE (CQTE) equals the sum of individual CQTEs across time, even though the conditional quantile of cumulative rewards may not necessarily equate to the sum of conditional quantiles of individual rewards. This crucial insight significantly streamlines the estimation and inference processes for our target causal estimand. We then introduce two varying coefficient decision process (VCDP) models and devise an innovative method to test the dynamic CQTE. Moreover, we expand our approach to accommodate data from spatiotemporal dependent experiments and examine both conditional quantile direct and indirect effects. To showcase the practical utility of our method, we apply it to three real-world datasets from a ride-sourcing platform. Theoretical findings and comprehensive simulation studies further substantiate our proposal.


A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets

arXiv.org Artificial Intelligence

This paper concerns the applications in the two-sided markets that involve a group of subjects who are making sequential decisions across time and/or location. In particular, we consider large-scale fleet management in ride-sharing companies, such as Uber, Lyft and Didi. These companies form a typical two-sided market that enables efficient interactions between passengers and drivers (Armstrong, 2006; Rysman, 2009). With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings (Frenken and Schor, 2017; Jin et al., 2018; Hagiu and Wright, 2019). With rich information on passenger demand and locations of taxi drivers, they significantly reduce taxi cruise time and passenger waiting time in comparison to traditional taxi systems (Li et al., 2011; Zhang et al., 2014; Miao et al., 2016). We use the numbers of drivers and call orders to measure the supply and demand at a given time and location. Both supply and demand are spatio-temporal processes and they interact with each other. These processes depend strongly on the platform's policies, and have a huge impact on the platform's outcomes of interest, such as drivers' income level and working time, passengers' satisfaction rate, order answering rate and order finishing rate, etc.


Sequential Knockoffs for Variable Selection in Reinforcement Learning

arXiv.org Artificial Intelligence

Interest in reinforcement learning (RL, Sutton & Barto 2018) has increased dramatically in recent years due in part to a number of high-profile successes in games (Mnih et al. 2013, 2015), autonomous driving (Sallab et al. 2017), and precision medicine (Tsiatis et al. 2019). However, despite theoretical and computational advances, real-world application of RL remains difficult. A primary challenge is dealing with high-dimensional state representations. Such representations occur naturally in systems with high-dimensional measurements, like images or audio, but can also occur when the system state is constructed by concatenating a series of measurements over a contiguous block of time. A high-dimensional state-- when a more parsimonious one would suffice--dilutes the efficiency of learning algorithms and makes the estimated optimal policy harder to interpret. Thus, methods for removing uninformative or redundant variables from the state are of tremendous practical value. We develop a general variable selection algorithm for offline RL, which aims to learn an optimal policy using only logged data, i.e., without any additional online interaction. Our contributions can be summarized as follows: (i) we formally define a minimal sufficient state for an MDP and argue that it is an appropriate target by which to design and evaluate variable selection methods in RL; (ii) we show that naïve variable selection methods based on the state or reward alone need not recover the minimal sufficient state; (iii) we propose a novel sequential knockoffs (SEEK) algorithm that applies with general black-box learning methods, and, under a β-mixing condition, consistently recovers the minimal sufficient state, and controls the false discovery rate (FDR, the ratio of the number of selected irrelevant variables to the number of selected variables); and (iv) we develop a novel algorithm to estimate the β-mixing coefficients of an MDP. The algorithm in (iv) is important in its own right as it applies to a number of applications beyond RL (McDonald et al. 2015).


Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

arXiv.org Artificial Intelligence

In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.