rho-POMDPs have Lipschitz-Continuous epsilon-Optimal Value Functions
Fehr, Mathieu, Buffet, Olivier, Thomas, Vincent, Dibangoye, Jilles
–Neural Information Processing Systems
Many state-of-the-art algorithms for solving Partially Observable Markov Decision Processes (POMDPs) rely on turning the problem into a "fully observable" problem--a belief MDP--and exploiting the piece-wise linearity and convexity (PWLC) of the optimal value function in this new state space (the belief simplex). This approach has been extended to solving ρ-POMDPs--i.e., for information-oriented criteria--when the reward ρ is convex in . General ρ-POMDPs can also be turned into "fully observable" problems, but with no means to exploit the PWLC property. In this paper, we focus on POMDPs and ρ-POMDPs with λ ρ -Lipschitz reward function, and demonstrate that, for finite horizons, the optimal value function is Lipschitz-continuous. Then, value function approximators are proposed for both upper- and lower-bounding the optimal value function, which are shown to provide uniformly improvable bounds.
Neural Information Processing Systems
Feb-14-2020, 19:11:34 GMT
- Technology: