Finding good policies in average-reward Markov Decision Processes without prior knowledge

Neural Information Processing Systems 

We revisit the identification of an \varepsilon -optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, D, and the optimal bias span, H, which satisfy H\leq D . Prior work have studied the complexity of \varepsilon -optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with D \simeq H for which the sample complexity to output an \varepsilon -optimal policy is \Omega(SAD/\varepsilon 2) where S and A are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order SAH/\varepsilon 2 has been proposed, but it requires the knowledge of H .