Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback

Jan-19-2025, 20:20:52 GMT–Neural Information Processing Systems

In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we prove that it attains the \widetilde{O}(K {\frac{5}{6}}A {\frac{1}{2}}d\ln(1 M)/(1-\gamma) 2) regret guarantee, where d is rank of the transition kernel (and hence the dimension of the unknown representations), A is the cardinality of the action space, M is the cardinality of the model class that contains all the plausible representations, and \gamma is the discounted factor. Notably, our algorithm is oracle-efficient and has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an \Omega(\frac{\gamma 2}{1-\gamma} \sqrt{d A K}) regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting.

adversarial low-rank markov decision process, regret guarantee, unknown transition and full-information feedback, (7 more...)

Neural Information Processing Systems

Jan-19-2025, 20:20:52 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (0.79)
  - Machine Learning > Learning Graphical Models
    - Undirected Networks > Markov Models (0.40)