Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models
–Neural Information Processing Systems
Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset comprises 1152 physiciancurated clinical vignettes structured as interactive scenarios that simulate a viva voce examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. We evaluated several state-of-the-art LLMs and found that while models demonstrate competence in diagnosing conditions within well-described clinical presentations, their performance degrades significantly when required to navigate diagnostic uncertainty. Our analysis identified several failure modes that mirror common issues in clinical practice, including: (1) fixation on initial hypotheses, (2) excessive investigation ordering, (3) premature diagnostic closure, and (4) missing critical conditions. These patterns reveal fundamental limitations in how current LLMs manage uncertainty and gather information sequentially. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.
Neural Information Processing Systems
Jun-22-2026, 09:01:21 GMT
- Genre:
- Instructional Material (0.67)
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Industry:
- Health & Medicine
- Diagnostic Medicine > Imaging (1.00)
- Therapeutic Area
- Pulmonary/Respiratory Diseases (1.00)
- Neurology (1.00)
- Infections and Infectious Diseases (1.00)
- Hematology (1.00)
- Cardiology/Vascular Diseases (1.00)
- Gastroenterology (0.94)
- Endocrinology (0.68)
- Immunology (0.67)
- Health & Medicine
- Technology: