Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Neural Information Processing Systems 

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found