Minimax Optimal Reinforcement Learning for Discounted MDPs

He, Jiafan, Zhou, Dongruo, Gu, Quanquan

arXiv.org Machine Learning 

The goal of reinforcement learning is designing algorithms to learn the optimal policy through interactions with the unknown dynamic environment. Markov decision processes (MDPs) plays a central role in reinforcement learning due to their ability to describe the time-independent state transition property. In specific, the discounted MDP is one of the standard MDPs in reinforcement learning to describe sequential tasks without interruption or restart. Various reinforcement learning algorithms have been proposed for discounted MDPs. In specific, Azar et al. (2013) proposed an Empirical QVI algorithm which achieves the optimal sample complexity to find the optimal value function. Sidford et al. (2018a) proposed a sublinear randomized value iteration algorithm that achieves a near-optimal sample complexity to find the optimal policy, and Sidford et al. (2018b) further improved it to reach the optimal sample complexity.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found