Dynamic Regret of Adversarial Linear Mixture MDPs

Jan-19-2025, 21:09:46 GMT–Neural Information Processing Systems

We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the \emph{dynamic regret} as the performance measure. Denote by d the dimension of the feature mapping, H the horizon, K the number of episodes, P_T the non-stationary measure, we propose a novel algorithm that enjoys an \widetilde{\mathcal{O}}\big(\sqrt{d 2 H 3K} \sqrt{H 4(K P_T)(1 P_T)}\big) dynamic regret under the condition that P_T is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an \Omega\big(\sqrt{d 2 H 3 K} \sqrt{H K (H P_T)}\big) lower bound, indicating our algorithm is \emph{optimal} in K and P_T . Furthermore, when the non-stationary measure P_T is unknown, we design an online ensemble algorithm with a meta-base structure, which is proved to achieve an \widetilde{\mathcal{O}}\big(\sqrt{d 2 H 3K} \sqrt{H 4(K P_T)(1 P_T) H 2 S_T 2}\big) dynamic regret and here S_T is the expected switching number of the best base-learner.

adversarial linear mixture mdp, dynamic regret, sqrt, (6 more...)

Neural Information Processing Systems

Jan-19-2025, 21:09:46 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.42)