suboptimality gap
Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling
Cho, Young Hyun, Sun, Will Wei
Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.
- North America > United States (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > California (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- North America > United States > Ohio (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Austria (0.04)
- (5 more...)
Achieving Constant Regret in Linear Markov Decision Processes
We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspec-ified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to mis-specification level ζ . At the core of Cert-LSVI-UCB is an innovative certified estimator, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Texas > Brazos County > College Station (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Singapore (0.04)
- North America > Canada (0.04)
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- North America > United States (0.15)
- Asia > Singapore (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- (2 more...)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Texas (0.04)
- (4 more...)