MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

Grolleau, François, Alsentzer, Emily, Keyes, Timothy, Chung, Philip, Swaminathan, Akshay, Aali, Asad, Hom, Jason, Huynh, Tridu, Lew, Thomas, Liang, April S., Chu, Weihan, Steele, Natasha Z., Lin, Christina F., Yang, Jingkun, Black, Kameron C., Ma, Stephen P., Haredasht, Fateme N., Shah, Nigam H., Schulman, Kevin, Chen, Jonathan H.

arXiv.org Artificial Intelligence 

Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found