Sequential Knockoffs for Variable Selection in Reinforcement Learning

Ma, Tao, Cai, Hengrui, Qi, Zhengling, Shi, Chengchun, Laber, Eric B.

arXiv.org Artificial Intelligence 

Interest in reinforcement learning (RL, Sutton & Barto 2018) has increased dramatically in recent years due in part to a number of high-profile successes in games (Mnih et al. 2013, 2015), autonomous driving (Sallab et al. 2017), and precision medicine (Tsiatis et al. 2019). However, despite theoretical and computational advances, real-world application of RL remains difficult. A primary challenge is dealing with high-dimensional state representations. Such representations occur naturally in systems with high-dimensional measurements, like images or audio, but can also occur when the system state is constructed by concatenating a series of measurements over a contiguous block of time. A high-dimensional state-- when a more parsimonious one would suffice--dilutes the efficiency of learning algorithms and makes the estimated optimal policy harder to interpret. Thus, methods for removing uninformative or redundant variables from the state are of tremendous practical value. We develop a general variable selection algorithm for offline RL, which aims to learn an optimal policy using only logged data, i.e., without any additional online interaction. Our contributions can be summarized as follows: (i) we formally define a minimal sufficient state for an MDP and argue that it is an appropriate target by which to design and evaluate variable selection methods in RL; (ii) we show that naïve variable selection methods based on the state or reward alone need not recover the minimal sufficient state; (iii) we propose a novel sequential knockoffs (SEEK) algorithm that applies with general black-box learning methods, and, under a β-mixing condition, consistently recovers the minimal sufficient state, and controls the false discovery rate (FDR, the ratio of the number of selected irrelevant variables to the number of selected variables); and (iv) we develop a novel algorithm to estimate the β-mixing coefficients of an MDP. The algorithm in (iv) is important in its own right as it applies to a number of applications beyond RL (McDonald et al. 2015).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found