Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

Oct-2-2025–arXiv.org Artificial Intelligence

Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched expert ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.

large language model, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

Oct-2-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Indonesia > Bali (0.04)
  - Middle East > Jordan (0.04)
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
- North America
  - Canada > Ontario
    - Toronto (0.04)
  - United States
    - Florida > Miami-Dade County
      - Miami (0.04)
    - Georgia > Fulton County
      - Atlanta (0.04)
    - Maryland > Montgomery County
      - Bethesda (0.04)
    - Massachusetts > Suffolk County
      - Boston (0.04)
    - Texas > Travis County
      - Austin (0.04)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Government > Regional Government
  - North America Government > United States Government (1.00)
- Health & Medicine
  - Health Care Technology (0.68)
  - Therapeutic Area (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Natural Language
    - Chatbot (0.46)
    - Large Language Model (0.49)
    - Question Answering (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found