Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
Soni, Sarvesh, Demner-Fushman, Dina
–arXiv.org Artificial Intelligence
Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched expert ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.
arXiv.org Artificial Intelligence
Oct-2-2025
- Country:
- Asia
- Europe
- Austria > Vienna (0.14)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Maryland > Montgomery County
- Bethesda (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Texas > Travis County
- Austin (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Technology: