Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward
–Neural Information Processing Systems
The latter case can be further reduced to adversarial MDP when preferences only depend on the final state.
Neural Information Processing Systems
Feb-17-2026, 21:20:54 GMT