Avoiding exp(R) scaling in RLHF through Preference-based Exploration
–Neural Information Processing Systems
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focuses on improving its sample efficiency. All existing algorithms for online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the range of the reward function. This statistical inefficiency hinders their effectiveness in scenarios with heavily skewed preferences, e.g.
Neural Information Processing Systems
Jun-14-2026, 07:13:01 GMT
- Technology: