Reward-Mixing MDPs with a Few Latent Contexts are Learnable

Kwon, Jeongyeol, Efroni, Yonathan, Caramanis, Constantine, Mannor, Shie

arXiv.org Artificial Intelligence 

Reinforcement learning (RL) in partially observable syste ms is a challenging problem. While partially observable Markov decision process (POMDP) is a versatile fra mework, POMDPs are generally hard to learn, primarily because the optimal policy depends on the entire h istory of the process [ 40, 28 ]. Due to its fundamental hardness, it is important to consider sub-classes of POMDPs that allow tractable solutions for a variety of applications. We are interested in a special and p revalent sub-class of POMDPs where the latent (unobservable) parts of the system remain static in each epi sode. Specifically, we consider the framework of Latent MDPs (LMDP s), which has been studied in a few several works ( e.g., [ 8, 5, 22, 41, 30 ]). In LMDPs, one MDP is randomly chosen from M possible candidate models at the beginning of every episode, and an agent intera cts with the chosen MDP for H time steps of an episode. However, the identity of the chosen MDP is unknown t o the agent, which we call the latent contexts . To learn near-optimal policies with latent contexts, exist ing POMDP solutions would require strong assumptions on reachability of the system ( e.g., [ 2, 21 ]) or certain separability assumptions ( e.g., see conditions Most work is done while the author is at The University of Texa s at Austin.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found