MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework

Yao, Zonghai, Zhang, Zihao, Tang, Chaolong, Bian, Xingyu, Zhao, Youxia, Yang, Zhichao, Wang, Junda, Zhou, Huixue, Jang, Won Seok, Ouyang, Feiyun, Yu, Hong

Oct-2-2024–arXiv.org Artificial Intelligence

Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs' clinical capabilities for both open- and closed-source LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-2-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > Massachusetts (0.27)

Genre:
- Personal > Interview (0.92)
- Research Report
  - Experimental Study (1.00)
  - New Finding (1.00)

Industry:
- Education (1.00)
- Health & Medicine
  - Consumer Health (1.00)
  - Diagnostic Medicine (1.00)
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Cardiology/Vascular Diseases (1.00)
    - Endocrinology > Diabetes (1.00)
    - Gastroenterology (0.67)
    - Musculoskeletal (0.92)
    - Nephrology (0.67)
    - Neurology (1.00)
    - Psychiatry/Psychology > Addiction Disorder (0.69)
    - Pulmonary/Respiratory Diseases (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)