Finding good policies in average-reward Markov Decision Processes without prior knowledge

May-27-2025, 16:01:05 GMT–Neural Information Processing Systems

We revisit the identification of an \varepsilon -optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, D, and the optimal bias span, H, which satisfy H\leq D . Prior work have studied the complexity of \varepsilon -optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with D \simeq H for which the sample complexity to output an \varepsilon -optimal policy is \Omega(SAD/\varepsilon 2) where S and A are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order SAH/\varepsilon 2 has been proposed, but it requires the knowledge of H .

average-reward markov decision process, complexity, sample complexity, (7 more...)

Neural Information Processing Systems

May-27-2025, 16:01:05 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.63)