On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

Xiao, Chenjun, Lee, Ilbin, Dai, Bo, Schuurmans, Dale, Szepesvari, Csaba

arXiv.org Artificial Intelligence 

Batch reinforcement learning (RL) broadly refers to the problem of finding a policy with high expected return in a stochastic control problem when only a batch of data collected from the controlled system is available. Here, we consider this problem for finite state-action ("tabular") Markov decision processes (MDPs) when the data takes the form of trajectories obtained by following some logging policy. In more details, the trajectories are composed of sequences of states, actions, and rewards, where the action is chosen by the logging policy and the next state and rewards follow the distributions specified by the MDP's transition parameters. Arguably, this is the most natural problem setting to consider in batch learning. The basic questions are (a) what logging policy should one choose to generate the data so as to maximize the chance of obtaining a good policy with as little data as possible; and (b) how many transitions are sufficient and necessary to obtain a good policy and which algorithm to use to obtain such a policy? Note that here the logging policy needs to be chosen a priori, that is, before the data collection process begins. An alternative way of saying this is that the data collection is done in a passive way. Our main results are as follows: First, we show that (perhaps unsurprisingly), in the lack of extra information about the nature of the MDP, the best logging policy should choose actions uniformly at random. Next, we show that the number of transitions necessary and sufficient to obtain a good policy, the sample complexity of learning, is an exponential function of the minimum of the number of states and the planning horizon.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found