Multi-turn Reinforcement Learning from Preference Human Feedback

Feb-18-2026, 08:02:43 GMT–Neural Information Processing Systems

In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium.

large language model, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Feb-18-2026, 08:02:43 GMT

Conferences PDF

Country:
- North America > United States (0.14)
- South America > Peru (0.04)
- Europe
  - Russia (0.04)
  - Hungary (0.04)
  - Austria (0.04)
  - United Kingdom (0.04)
  - Germany (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
- Asia
  - Russia (0.04)
  - China (0.04)
  - Middle East > Israel
    - Tel Aviv District > Tel Aviv (0.04)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Education > Educational Setting > Online (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
d77a7b289361abff82bdd2fb537ae152-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found