Large Scale Markov Decision Processes with Changing Rewards

Oct-10-2024, 02:51:38 GMT–Neural Information Processing Systems

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of O( \sqrt{\tau (\ln S \ln A)T}\ln(T)), where S is the state space, A is the action space, \tau is the mixing time of the MDP, and T is the number of periods. The algorithm's computational complexity is polynomial in S and A . We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d\ll S, we propose a modified algorithm with a computational complexity polynomial in d and independent of S .

algorithm, markov decision process, scale markov decision process, (4 more...)

Neural Information Processing Systems

Oct-10-2024, 02:51:38 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.65)