Efficient RL for optimizing conversation level outcomes with an LLM-based tutor
Nam, Hyunji, Gottesman, Omer, Zhang, Amy, Foster, Dean, Brunskill, Emma, Ungar, Lyle
–arXiv.org Artificial Intelligence
Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.
arXiv.org Artificial Intelligence
Jul-23-2025
- Country:
- Africa > Mali (0.04)
- Asia > Middle East
- Jordan (0.04)
- Europe > Monaco (0.04)
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California > Santa Clara County
- Palo Alto (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Pennsylvania (0.04)
- Texas > Travis County
- Austin (0.04)
- Washington > King County
- Seattle (0.04)
- California > Santa Clara County
- Mexico > Mexico City
- Genre:
- Research Report > New Finding (0.87)
- Industry:
- Education
- Educational Setting > K-12 Education (0.49)
- Educational Technology (0.46)
- Education
- Technology: