A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool
Flanders, Adam E., Peng, Yifan, Prevedello, Luciano, Ball, Robyn, Colak, Errol, Menon, Prahlad, Shih, George, Lin, Hui-Ming, Lakhani, Paras
–arXiv.org Artificial Intelligence
Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.
arXiv.org Artificial Intelligence
Oct-31-2025
- Country:
- Europe > Belgium
- Flanders (0.05)
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- California > San Francisco County
- San Francisco (0.15)
- Maine (0.04)
- New York > New York County
- New York City (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.14)
- Washington > King County
- Redmond (0.04)
- California > San Francisco County
- Canada > Ontario
- Europe > Belgium
- Genre:
- Research Report
- Experimental Study > Negative Result (0.34)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine
- Diagnostic Medicine > Imaging (1.00)
- Health Care Providers & Services (1.00)
- Nuclear Medicine (1.00)
- Therapeutic Area (0.92)
- Health & Medicine
- Technology: