Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback
–Neural Information Processing Systems
In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we prove that it attains the \widetilde{O}(K {\frac{5}{6}}A {\frac{1}{2}}d\ln(1 M)/(1-\gamma) 2) regret guarantee, where d is rank of the transition kernel (and hence the dimension of the unknown representations), A is the cardinality of the action space, M is the cardinality of the model class that contains all the plausible representations, and \gamma is the discounted factor. Notably, our algorithm is oracle-efficient and has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an \Omega(\frac{\gamma 2}{1-\gamma} \sqrt{d A K}) regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting.