Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

Open in new window