Wan, Runzhe
A Review of Causal Decision Making
Ge, Lin, Cai, Hengrui, Wan, Runzhe, Xu, Yang, Song, Rui
To make effective decisions, it is important to have a thorough understanding of the causal relationships among actions, environments, and outcomes. This review aims to surface three crucial aspects of decision-making through a causal lens: 1) the discovery of causal relationships through causal structure learning, 2) understanding the impacts of these relationships through causal effect learning, and 3) applying the knowledge gained from the first two aspects to support decision making via causal policy learning. Moreover, we identify challenges that hinder the broader utilization of causal decision-making and discuss recent advances in overcoming these challenges. Finally, we provide future research directions to address these challenges and to further enhance the implementation of causal decision-making in practice, with real-world applications illustrated based on the proposed causal decision-making. We aim to offer a comprehensive methodology and practical implementation framework by consolidating various methods in this area into a Python-based collection. URL: https://causaldm.github.io/Causal-Decision-Making.
A Review of Reinforcement Learning in Financial Applications
Bai, Yahui, Gao, Yuhe, Wan, Runzhe, Zhang, Sheng, Song, Rui
A financial market is a marketplace where financial instruments such as stocks and bonds are bought and sold (Fama 1970). Individuals and organizations can play crucial roles in financial markets to facilitate the allocation of capital. Market participants face diverse challenges, such as portfolio management, which aims to maximize investment returns over time, and market-making, which seeks to profit from the bid-ask spread while managing inventory risk. As the volume of financial data has increased dramatically over time, new opportunities and challenges have arisen in the analysis process, leading to the increased adoption of advanced Machine Learning (ML) models. Reinforcement Learning (RL)(Sutton & Barto 2018), as one of the main categories of ML, has revolutionized the field of artificial intelligence by empowering agents to interact with the environment and allowing them to learn and improve their performance. The success of RL has been demonstrated in various fields, including games, robots, mobile health (Nash Jr 1950, Kalman 1960, Murphy 2003), etc. In finance, applications such as market making, portfolio management, and order execution can benefit from the ability of RL algorithms to learn and adapt to changing environments. Compared to traditional models that rely on statistical techniques and econometric methods such as time series models (ARMA, ARIMA), factor models, and panel models, the RL framework empowers agents to learn decision-making by interacting with an environment and deducing the consequences of past actions to maximize cumulative rewards (Charpentier et al. 2021).
Zero-Inflated Bandits
Wei, Haoyu, Wan, Runzhe, Shi, Lei, Song, Rui
Many real applications of bandits have sparse non-zero rewards, leading to slow learning rates. A careful distribution modeling that utilizes problem-specific structures is known as critical to estimation efficiency in the statistics literature, yet is under-explored in bandits. To fill the gap, we initiate the study of zero-inflated bandits, where the reward is modeled as a classic semi-parametric distribution called zero-inflated distribution. We carefully design Upper Confidence Bound (UCB) and Thompson Sampling (TS) algorithms for this specific structure. Our algorithms are suitable for a very general class of reward distributions, operating under tail assumptions that are considerably less stringent than the typical sub-Gaussian requirements. Theoretically, we derive the regret bounds for both the UCB and TS algorithms for multi-armed bandit, showing that they can achieve rate-optimal regret when the reward distribution is sub-Gaussian. The superior empirical performance of the proposed methods is shown via extensive numerical studies.
Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches
Liu, Yu, Wan, Runzhe, McQueen, James, Hains, Doug, Gu, Jinxiang, Song, Rui
The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of data-driven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches.
Robust Offline Policy Evaluation and Optimization with Heavy-Tailed Rewards
Zhu, Jin, Wan, Runzhe, Qi, Zhengling, Luo, Shikai, Shi, Chengchun
This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation (OPE) and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions.
STEEL: Singularity-aware Reinforcement Learning
Chen, Xiaohong, Qi, Zhengling, Wan, Runzhe
Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy that maximizes the expected total rewards in a dynamic environment. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution, so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice (e.g., no-overlap support), especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, by requiring only minimal data-coverage assumption, STEEL significantly improves the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method in dealing with possible singularity in batch RL.
Experimentation Platforms Meet Reinforcement Learning: Bayesian Sequential Decision-Making for Continuous Monitoring
Wan, Runzhe, Liu, Yu, McQueen, James, Hains, Doug, Song, Rui
With the growing needs of online A/B testing to support the innovation in industry, the opportunity cost of running an experiment becomes non-negligible. Therefore, there is an increasing demand for an efficient continuous monitoring service that allows early stopping when appropriate. Classic statistical methods focus on hypothesis testing and are mostly developed for traditional high-stake problems such as clinical trials, while experiments at online service companies typically have very different features and focuses. Motivated by the real needs, in this paper, we introduce a novel framework that we developed in Amazon to maximize customer experience and control opportunity cost. We formulate the problem as a Bayesian optimal sequential decision making problem that has a unified utility function. We discuss extensively practical design choices and considerations. We further introduce how to solve the optimal decision rule via Reinforcement Learning and scale the solution. We show the effectiveness of this novel approach compared with existing methods via a large-scale meta-analysis on experiments in Amazon.
A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets
Shi, Chengchun, Wan, Runzhe, Song, Ge, Luo, Shikai, Song, Rui, Zhu, Hongtu
This paper concerns the applications in the two-sided markets that involve a group of subjects who are making sequential decisions across time and/or location. In particular, we consider large-scale fleet management in ride-sharing companies, such as Uber, Lyft and Didi. These companies form a typical two-sided market that enables efficient interactions between passengers and drivers (Armstrong, 2006; Rysman, 2009). With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings (Frenken and Schor, 2017; Jin et al., 2018; Hagiu and Wright, 2019). With rich information on passenger demand and locations of taxi drivers, they significantly reduce taxi cruise time and passenger waiting time in comparison to traditional taxi systems (Li et al., 2011; Zhang et al., 2014; Miao et al., 2016). We use the numbers of drivers and call orders to measure the supply and demand at a given time and location. Both supply and demand are spatio-temporal processes and they interact with each other. These processes depend strongly on the platform's policies, and have a huge impact on the platform's outcomes of interest, such as drivers' income level and working time, passengers' satisfaction rate, order answering rate and order finishing rate, etc.
Multiplier Bootstrap-based Exploration
Wan, Runzhe, Wei, Haoyu, Kveton, Branislav, Song, Rui
Despite the great interest in the bandit problem, designing efficient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. In this paper, we propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real data experiments, we show the generality and adaptivity of MBE.
Heterogeneous Synthetic Learner for Panel Data
Shen, Ye, Wan, Runzhe, Cai, Hengrui, Song, Rui
Evaluating the treatment effect from panel data has become an increasingly important problem in numerous areas including public health (Cole et al. 2020, Goodman-Bacon & Marcus 2020), politics (Abadie et al. 2010, Sabia et al. 2012), economics (Cavallo et al. 2013, Dube & Zipperer 2015), etc. During the past decades, a number of methods have been developed to estimate the average treatment effect (ATE) from panel data, including the celebrated Difference-in-Differences (DiD) (Abadie 2005) and the Synthetic Control (SC) method (Abadie & Gardeazabal 2003, Abadie et al. 2010). Yet, due to the heterogeneity of individuals in response to treatments, there may not exist one single uniformly optimal treatment across individuals. Thus, one major focus in causal machine learning is to access the Heterogeneous Treatment Effect (HTE) (see e.g., Athey & Imbens 2015, Shalit et al. 2017, Wager & Athey 2018, Künzel et al. 2019, Farrell et al. 2021) that measures the causal impact within a given group. Detecting such a heterogeneity in panel data hence becomes an inevitable trend in the new era of personalization. However, estimating HTE in panel data is surprisingly underexplored in the literature. On the one hand, despite the fact that there are many methods for the HTE estimation (see e.g., Athey & Imbens 2016, Johnson et al. 2019, Künzel et al. 2019, Nie & Wager 2021, and the reference therein), most of these works focus on independently and identically distributed (i.i.d.) observations and thus are infeasible to handle the non-stationarity and temporal dependency in the common panel data setting. On the other hand, in contrast to the popularity of estimating ATE in panel data as mentioned above, limited progress has been achieved for HTE.