Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward

Neural Information Processing Systems 

The latter case can be further reduced to adversarial MDP when preferences only depend on the final state.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found