Reviews: PAC Reinforcement Learning with Rich Observations
–Neural Information Processing Systems
Contextual MDPs are a specific type of POMDPs with the restriction that the optimal q-function depends only on the most recent observation (instead of the belief state). The authors show that Contextual MDPs are not poly PAC learneable even when either memoryless policies are considered or value function approximation is used. However, when both memoryless policies and value function approximation is used and the transitions are deterministic, then the model is PAC learnable in a polynomial number of episodes (and the complexity is independent of the number of observations). The paper is well written overall. The proofs are quite clear and quite thorough. I am not quite sure that the 16 pages of technical proofs in the appendix are suitable for a conference; the paper may better fit a journal format.
Neural Information Processing Systems
Jan-20-2025, 07:52:31 GMT