Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

Neural Information Processing Systems 

The second step is reward modeling, which is the origin of the name "reward-based".

Similar Docs  Excel Report  more

TitleSimilaritySource
None found