Automatic Evaluation of Healthcare LLMs Beyond Question-Answering

Arias-Duart, Anna, Martin-Torres, Pablo Agustin, Hinjos, Daniel, Bernabeu-Perez, Pablo, Ganzabal, Lucia Urcelay, Mallo, Marta Gonzalez, Gururajan, Ashwin Kumar, Lopez-Cuena, Enrique, Alvarez-Napagao, Sergio, Garcia-Gasulla, Dario

Feb-10-2025–arXiv.org Artificial Intelligence

Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark--CareQA--, with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to mitigate the identified limitations.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Feb-10-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County > New York City (0.04)
- Europe
  - Montenegro (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)

Genre:
- Research Report (1.00)

Industry:
- Materials > Chemicals (0.67)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Consumer Health (0.93)
  - Therapeutic Area > Neurology (0.93)
- Education > Health & Safety
  - School Nutrition (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.74)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found