RL for Latent MDPs: Regret Guarantees and a Lower Bound

Jan-19-2025, 06:01:06 GMT–Neural Information Processing Systems

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least \Omega((SA) M) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics.

latent mdp, lower bound, regret guarantee, (5 more...)

Neural Information Processing Systems

Jan-19-2025, 06:01:06 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.83)