Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Wei, Fei, Chen, Daoyuan, Wang, Ce, Huang, Yilun, Chen, Yushuo, Pan, Xuchen, Li, Yaliang, Ding, Bolin
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners--a critical capability in high-stakes domains--remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent "reality gap". To bridge this gap, we introduce Learn-to-Ask, a general, simulator-free framework for learning and deploying proactive dialogue agents directly from offline expert data, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the observed future of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured (action, state assessment) tuple, governing both what to ask and, crucially, when to stop. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of Learn-to-Ask in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications. Across industries such as healthcare, law, and finance, numerous goal-oriented conversations take place every day between human experts and their clients (Wang et al., 2025; Y ang et al., 2023). This vast corpus of dialogue data represents a largely untapped goldmine, containing implicit expert-driven strategies for navigating complex, information-seeking scenarios. While organizations possess these valuable data assets, Large Language Models (LLMs) are seldom trained to harness them effectively. Instead, their default behavior remains largely passive, limiting their potential as truly collaborative and proactive partners. In high-stakes domains, this passivity is a critical failure - an intelligent LLM application should not merely answer questions but proactively form a policy to gather information and drive the conversation towards a designated goal. Two main paradigms have emerged to instill such proactivity, yet both struggle with a significant "reality gap". It optimizes for local attributes and fails to learn a coherent, sequential policy that accounts for temporal dependencies in a conversation.
arXiv.org Artificial Intelligence
Nov-10-2025