Sublinear Regret for Learning POMDPs

Xiong, Yi, Chen, Ningyuan, Gao, Xuefeng, Zhou, Xiang

Jul-16-2022–arXiv.org Artificial Intelligence

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O(T^{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

algorithm, gao and zhou, pomdp, (15 more...)

arXiv.org Artificial Intelligence

Jul-16-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > New York (0.04)
  - Canada > Ontario
    - Toronto (0.14)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East > Jordan (0.04)
  - Japan > Honshū
    - Tōhoku (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found