From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents

Lee, Gyubok, Chay, Woosog, Kwak, Heeyoung, Kim, Yeong Hwa, Yoo, Haanju, Jeong, Oksoon, Son, Meong Hi, Choi, Edward

Sep-30-2025–arXiv.org Artificial Intelligence

Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Experiments with state-of-the-art LLMs (e.g., o4-mini and Gemini-2.5-Flash) over five i.i.d. trials show that while agents achieve high Pass@5 of 90-95% (at least one of five trials) on IncreQA and 60-80% on AdaptQA, their Pass^5 (consistent success across all five trials) is substantially lower by 35-60%. These results underscore the need to build agents that are not only performant but also robust for the safety-critical EHR domain. Finally, we provide diagnostic insights into common failure modes to guide future agent development.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Asia (0.92)
- Europe (0.67)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (0.87)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Health Care Technology > Medical Record (1.00)
  - Health Care Providers & Services (1.00)
  - Diagnostic Medicine (1.00)
  - Government Relations & Public Policy (0.93)
  - Therapeutic Area
    - Neurology > Epilepsy (1.00)
    - Genetic Disease (1.00)
    - Cardiology/Vascular Diseases (1.00)
- Government > Regional Government
  - North America Government > United States Government > FDA (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found