A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool

Flanders, Adam E., Peng, Yifan, Prevedello, Luciano, Ball, Robyn, Colak, Errol, Menon, Prahlad, Shih, George, Lin, Hui-Ming, Lakhani, Paras

Oct-31-2025–arXiv.org Artificial Intelligence

Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with llama3.3:70b and GPT-4o (AUC= 0.78). The average precision was highest for Llama3.3:70b and GPT-4o (AP=0.75 & 0.76). Llama3.3:70b had the highest F1 score (0.81) and recall (0.85), greater precision (0.78), specificity (0.72), and MCC (0.57). Using MCC (95% CI) the ideal combination of LLMs were: Full-9 Ensemble 0.571 (0.552-0.591), Top-3 Ensemble 0.558 (0.537-0.579), Consensus 0.556 (0.539-0.574), and GPT4o 0.522 (0.500-0.543). No statistically significant differences were observed between Top-3, Full-9, and Consensus (p > 0.05). Conclusion: An ensemble of medium to large sized open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-31-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania (0.28)
  - California > San Francisco County
    - San Francisco (0.15)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study > Negative Result (0.34)

Industry:
- Health & Medicine
  - Nuclear Medicine (1.00)
  - Health Care Providers & Services (1.00)
  - Diagnostic Medicine > Imaging (1.00)
  - Therapeutic Area (0.92)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found