Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Open in new window