Policy Mirror Descent with Lookahead

Mar-19-2025, 10:15:11 GMT–Neural Information Processing Systems

Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as AlphaGo and AlphaZero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called h-PMD which incorporates multi-step greedy policy improvement with lookahead depth h to the PMD update rule.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Mar-19-2025, 10:15:11 GMT

Conferences PDF

Add feedback

Country:
- Europe > Switzerland > Zürich > Zürich (0.14)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology (0.48)
- Leisure & Entertainment > Games
  - Go (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.67)
    - Reinforcement Learning (0.86)
  - Representation & Reasoning > Planning & Scheduling (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found