LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

Bologna, Federica, Pan, Tiffany, Wilkens, Matthew, Guo, Yue, Wang, Lucy Lu

Oct-14-2025–arXiv.org Artificial Intelligence

Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Oct-14-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.68)
- Europe (0.67)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (0.68)
  - Therapeutic Area
    - Infections and Infectious Diseases (1.00)
    - Obstetrics/Gynecology (0.94)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.72)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found