concentrability
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
Zhao, Qingyue, Ji, Kaixuan, Zhao, Heyang, Gu, Quanquan
\emph{Kullback-Leibler} (KL) regularization is ubiquitous in reinforcement learning algorithms in the form of \emph{reverse} or \emph{forward} KL. Recent studies have demonstrated $ฮต^{-1}$-type fast rates for decision making under reverse KL regularization, in contrast to the standard $ฮต^{-2}$-type sample complexity. However, for forward-KL-regularized objectives, existing statistical analyses are either not applicable or result in $\tilde{O}(ฮต^{-2})$ slow rates. We take the first step towards addressing this problem via a streamlined analysis of forward-KL-regularized offline CBs. We give the first $\tilde{O}(ฮต^{-1})$ upper bounds in tabular and general function approximation settings, both under notions of \emph{single-policy concentrability}. In particular, our convex-analytical pipeline unifies these settings by exploiting the pessimism principle in a novel way and completely bypasses the proof routines in previous works based on the mean value theorem, which might be of independent interest. Moreover, we provide rate-optimal lower bounds, manifesting the tightness of our upper bounds in terms of statistical rates. Our lower bounds also demonstrate that the forward-KL-regularized sample complexity recovers the unregularized slow rate in the low-regularization regime, similarly to the reverse-KL regularization.
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
Ji, Kaixuan, Di, Qiwei, Zhao, Heyang, Zhao, Qingyue, Gu, Quanquan
Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(ฮทSAC^{ฯ^*}/ฮต)$ under large regularization $ฮท= \tilde{O}(ฮต^{-1})$, and a sample complexity of $\tildeฮฉ(SAC^{ฯ^*}/ฮต^2)$ under small regularization $ฮท= \tildeฮฉ(ฮต^{-1})$, where $ฮท$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{ฯ^*}$ policy coverage coefficient at the optimal policy $ฯ^*$, $ฮต$ is the desired sub-optimality, and $\tilde{O}$ and $\tildeฮฉ$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.