Goto

Collaborating Authors

 scale markov decision process


Large Scale Markov Decision Processes with Changing Rewards

Neural Information Processing Systems

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of $O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))$, where $S$ is the state space, $A$ is the action space, $\tau$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $|S|$ and $|A|$. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll|S|$, we propose a modified algorithm with a computational complexity polynomial in $d$ and independent of $|S|$. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first $\tilde{O}(\sqrt{T})$ regret bound in the large-scale MDP setting with adversarially changing rewards.



Reviews: Large Scale Markov Decision Processes with Changing Rewards

Neural Information Processing Systems

I still feel that the work would be greatly improved by adding numerical experiments. In particular, the authors refer to a specific setting called'online MDP', where the dynamics, that is, the transition probabilities, are known while the reward is not. Regret minimization then refers to the idea to minimize the regret'' given that rewards could be chosen/observed in an adversarial manner. The authors start with a (rather technical) introduction, pose related work, and explain the main ideas based on concise preliminaries. Afterwards, an extension to large state spaces by using approximate occupancy measures and thereby avoiding concrete state-mappings is provided.


Reviews: Large Scale Markov Decision Processes with Changing Rewards

Neural Information Processing Systems

The paper contributes new algorithmic ideas and theoretical results for regret minimization in Markov Decision Processes with known transition kernels but arbitrary cost functions. The reviewers broadly agree that the theoretical and algorithmic techniques introduced by the paper -- using the FTRL online learning idea and the extension to large MDPs via linear function approximation -- are novel, and thus the paper deserves to be published; however, the known-MDP-unknown-cost setting may be somewhat narrow in its applicability in practice.


Large Scale Markov Decision Processes with Changing Rewards

Neural Information Processing Systems

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of O( \sqrt{\tau (\ln S \ln A)T}\ln(T)), where S is the state space, A is the action space, \tau is the mixing time of the MDP, and T is the number of periods. The algorithm's computational complexity is polynomial in S and A . We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension d\ll S, we propose a modified algorithm with a computational complexity polynomial in d and independent of S .


Large Scale Markov Decision Processes with Changing Rewards

Neural Information Processing Systems

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves a regret bound of $O( \sqrt{\tau (\ln S \ln A)T}\ln(T))$, where $S$ is the state space, $A$ is the action space, $\tau$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $ S $ and $ A $. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll S $, we propose a modified algorithm with a computational complexity polynomial in $d$ and independent of $ S $. We also prove a regret bound for this modified algorithm, which to the best of our knowledge, is the first $\tilde{O}(\sqrt{T})$ regret bound in the large-scale MDP setting with adversarially changing rewards.