suboptimality gap
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness of KL-regularization has been empirically demonstrated in various practical scenarios, current theoretical analyses of KL-regularized RLHF still yield the same O(1/ฯต2) sample complexity as ones without KL-regularization. To understand the fundamental distinction between objectives with KL-regularization and ones without KLregularization, we are the first to theoretically demonstrate the power of KLregularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an O(1/ฯต) sample complexity when ฯต is sufficiently small. We also prove matching lower bounds for both settings. More specifically, we study how the coverage of the reference policy affects the sample complexity of KL-regularized online contextual bandits and RLHF. We show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling algorithm can achieve an O(1/ฯต) sample complexity with only an additive dependence on the coverage coefficient, thus proving the benefits of online data even without explicit exploration. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in online decision making, shedding light on the design of more efficient algorithms.
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
Ji, Kaixuan, Di, Qiwei, Zhao, Heyang, Zhao, Qingyue, Gu, Quanquan
Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(ฮทSAC^{ฯ^*}/ฮต)$ under large regularization $ฮท= \tilde{O}(ฮต^{-1})$, and a sample complexity of $\tildeฮฉ(SAC^{ฯ^*}/ฮต^2)$ under small regularization $ฮท= \tildeฮฉ(ฮต^{-1})$, where $ฮท$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{ฯ^*}$ policy coverage coefficient at the optimal policy $ฯ^*$, $ฮต$ is the desired sub-optimality, and $\tilde{O}$ and $\tildeฮฉ$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.
Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling
Cho, Young Hyun, Sun, Will Wei
Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.
Achieving Constant Regret in Linear Markov Decision Processes
We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspec-ified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to mis-specification level ฮถ . At the core of Cert-LSVI-UCB is an innovative certified estimator, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes.