Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

Jul-3-2024–arXiv.org Machine Learning

Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Machine Learning

Jul-3-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Israel (0.14)
- North America
  - Canada > British Columbia (0.14)
  - United States > Hawaii (0.14)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.71)
  - Reinforcement Learning (0.87)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found