The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

Gong, Linlu, Wang, Ante, Lai, Yunghwei, Ma, Weizhi, Liu, Yang

arXiv.org Artificial Intelligence 

An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.Figure 1: Comparison between MAQ E enables more realistic patient simulation by integrating diverse behaviors and evaluates doctor inquiries from more comprehensive and fine-grained perspectives. A medical career is among the most demanding professions to master. A physician's role extends far beyond treating diseases; it also involves employing nuanced conversational skills to understand a patient's condition and guide them through moments of vulnerability. Current Large Language Models (LLMs) have reached the initial stage of this journey by grasping extensive medical knowledge and expertise in clinical examinations (Nori et al., 2023; Wang et al., 2023; Saab et al., 2024; Singhal et al., 2025; Dou et al., 2025). However, their passive, response-driven nature (Li et al., 2024)--an inherent tendency to answer user queries directly rather than to engage in goal-oriented dialogue--limits their practical utility. This shortcoming is particularly critical in clinical consultation, the focus of this work, where an LLM must proactively converse with patients to gather information through thoughtful and compassionate inquiry. Existing studies (Liao et al., 2023; Li et al., 2024; Schmidgall et al., 2024; Nori et al., 2025) have proposed several benchmarks to evaluate the inquiry capabilities of LLMs. A prevalent method is to develop a virtual interaction environment in which a patient is simulated by an LLM based on a synthesized profile.