IMED-RL: Regret optimal learning of ergodic Markov decision processes

Feb-9-2025, 23:38:53 GMT–Neural Information Processing Systems

We consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic UCB, KL-UCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problem-dependent lower bounds on the regret are known under the assumption that the MDP is ergodic.

data mining, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Feb-9-2025, 23:38:53 GMT

Conferences PDF

Add feedback

Country:
- Europe > France (0.46)
- North America > United States (0.69)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.64)
    - Reinforcement Learning (0.89)
  - Data Science > Data Mining
    - Big Data (0.88)

Duplicate Docs Excel Report

Title
a8c9f9ccc45771d2fd06bcd04ff3442e-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found