Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Dec-24-2025, 20:52:22 GMT–Neural Information Processing Systems

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art.

adversarial mdp, improved exploration, policy optimization, (11 more...)

Neural Information Processing Systems

Dec-24-2025, 20:52:22 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.59)