MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

Grolleau, François, Alsentzer, Emily, Keyes, Timothy, Chung, Philip, Swaminathan, Akshay, Aali, Asad, Hom, Jason, Huynh, Tridu, Lew, Thomas, Liang, April S., Chu, Weihan, Steele, Natasha Z., Lin, Christina F., Yang, Jingkun, Black, Kameron C., Ma, Stephen P., Haredasht, Fateme N., Shah, Nigam H., Schulman, Kevin, Chen, Jonathan H.

Sep-9-2025–arXiv.org Artificial Intelligence

Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

large language model, llm jury, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Sep-9-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > Santa Clara County (0.29)

Genre:
- Workflow (1.00)
- Research Report
  - Experimental Study (0.68)
  - New Finding (0.67)

Industry:
- Health & Medicine
  - Therapeutic Area (1.00)
  - Health Care Providers & Services (0.69)
  - Health Care Technology > Medical Record (0.30)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found