suboptimality gap
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
Ji, Kaixuan, Di, Qiwei, Zhao, Heyang, Zhao, Qingyue, Gu, Quanquan
Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(ฮทSAC^{ฯ^*}/ฮต)$ under large regularization $ฮท= \tilde{O}(ฮต^{-1})$, and a sample complexity of $\tildeฮฉ(SAC^{ฯ^*}/ฮต^2)$ under small regularization $ฮท= \tildeฮฉ(ฮต^{-1})$, where $ฮท$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{ฯ^*}$ policy coverage coefficient at the optimal policy $ฯ^*$, $ฮต$ is the desired sub-optimality, and $\tilde{O}$ and $\tildeฮฉ$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.
Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling
Cho, Young Hyun, Sun, Will Wei
Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.
Achieving Constant Regret in Linear Markov Decision Processes
We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspec-ified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to mis-specification level ฮถ . At the core of Cert-LSVI-UCB is an innovative certified estimator, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes.