Multi-turn Reinforcement Learning from Preference Human Feedback
–Neural Information Processing Systems
In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium.
Neural Information Processing Systems
Feb-18-2026, 08:02:43 GMT
- Country:
- Asia
- China (0.04)
- Middle East > Israel
- Tel Aviv District > Tel Aviv (0.04)
- Russia (0.04)
- Europe
- Austria (0.04)
- France > Île-de-France
- Germany (0.04)
- Hungary (0.04)
- Russia (0.04)
- United Kingdom (0.04)
- North America > United States (0.14)
- South America > Peru (0.04)
- Asia
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Education > Educational Setting > Online (0.46)
- Technology: