Navigating to the Best Policy in Markov Decision Processes

Jan-19-2025, 09:02:18 GMT–Neural Information Processing Systems

We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least 1-\delta . We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting} \cite{pmlr-v139-marjani21a}, where the agent could at each step query the random outcome of any (state, action) pair.

best policy, markov decision process, navigating, (2 more...)

Neural Information Processing Systems

Jan-19-2025, 09:02:18 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning
    - Learning Graphical Models > Undirected Networks > Markov Models (0.85)
  - Decision Support Systems (0.70)