Multi-turn Reinforcement Learning from Preference Human Feedback

Neural Information Processing Systems 

In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found