Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Neural Information Processing Systems 

We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either full-information loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transitions (Foster et al., 2022). First, we improve the poly(d, A, H)T {5/6} regret bound of Zhao et al. (2024) to poly(d, A, H)T {2/3} for the full-information unknown transition setting, where d is the rank of the transitions, A is the number of actions, H is the horizon length, and T is the number of episodes. Next, we initiate the study on the setting with bandit loss feedback and unknown transitions. Assuming that the loss has a linear structure, we propose both model-based and model-free algorithms achieving poly(d, A, H)T {2/3} regret, though they are computationally inefficient.