Efficient RL for optimizing conversation level outcomes with an LLM-based tutor