Tractable Optimality in Episodic Latent MABs
–Neural Information Processing Systems
We consider a multi-armed bandit problem with A actions and M latent contexts, where an agent interacts with the environment for an episode of H time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. The resulting partial observation of the environment makes the learning task significantly more challenging.
Neural Information Processing Systems
Mar-27-2025, 07:35:59 GMT