Avoiding exp(R) scaling in RLHF through Preference-based Exploration

Jun-14-2026, 07:13:01 GMT–Neural Information Processing Systems

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focuses on improving its sample efficiency. All existing algorithms for online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the range of the reward function. This statistical inefficiency hinders their effectiveness in scenarios with heavily skewed preferences, e.g.

artificial intelligence, machine learning, proceedings, (9 more...)

Neural Information Processing Systems

Jun-14-2026, 07:13:01 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)