Large Scale Markov Decision Processes with Changing Rewards
–Neural Information Processing Systems
We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of O( \sqrt{\tau (\ln S \ln A)T}\ln(T)), where S is the state space, A is the action space, \tau is the mixing time of the MDP, and T is the number of periods. The algorithm's computational complexity is polynomial in S and A . We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d\ll S, we propose a modified algorithm with a computational complexity polynomial in d and independent of S .
Neural Information Processing Systems
Oct-10-2024, 02:51:38 GMT
- Technology: