Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

Chen, Zijun, Zhang, Zihan

arXiv.org Machine Learning 

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{π^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/ε^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{Δ_{\min}^{p}} + pKΔ_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $Δ_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found