RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

Neural Information Processing Systems 

We introduce the first sample-efficient algorithm for LMDPs without any additional distributional assumptions . Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found