Dynamic Regret of Adversarial Linear Mixture MDPs
–Neural Information Processing Systems
We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the \emph{dynamic regret} as the performance measure. Denote by d the dimension of the feature mapping, H the horizon, K the number of episodes, P_T the non-stationary measure, we propose a novel algorithm that enjoys an \widetilde{\mathcal{O}}\big(\sqrt{d 2 H 3K} \sqrt{H 4(K P_T)(1 P_T)}\big) dynamic regret under the condition that P_T is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an \Omega\big(\sqrt{d 2 H 3 K} \sqrt{H K (H P_T)}\big) lower bound, indicating our algorithm is \emph{optimal} in K and P_T . Furthermore, when the non-stationary measure P_T is unknown, we design an online ensemble algorithm with a meta-base structure, which is proved to achieve an \widetilde{\mathcal{O}}\big(\sqrt{d 2 H 3K} \sqrt{H 4(K P_T)(1 P_T) H 2 S_T 2}\big) dynamic regret and here S_T is the expected switching number of the best base-learner.
Neural Information Processing Systems
Jan-19-2025, 21:09:46 GMT
- Technology: