Sublinear Regret for Learning POMDPs
Xiong, Yi, Chen, Ningyuan, Gao, Xuefeng, Zhou, Xiang
–arXiv.org Artificial Intelligence
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O(T^{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
arXiv.org Artificial Intelligence
Jul-16-2022
- Country:
- North America
- United States > New York (0.04)
- Canada > Ontario
- Toronto (0.14)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Asia
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.34)
- Technology: