MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering
Hao, Yuexing, Alhamoud, Kumail, Jeong, Hyewon, Zhang, Haoran, Puri, Isha, Torr, Philip, Schaekermann, Mike, Stern, Ariel D., Ghassemi, Marzyeh
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these "relevant" subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: http://medpair.csail.mit.edu/.
arXiv.org Artificial Intelligence
Jun-2-2025
- Country:
- Africa > Chad
- Salamat (0.04)
- Asia
- China > Hong Kong (0.04)
- Indonesia > Bali (0.04)
- Malaysia (0.04)
- Middle East
- Jordan (0.04)
- Saudi Arabia > Asir Province
- Abha (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany > Brandenburg
- Potsdam (0.04)
- Italy (0.04)
- Middle East > Malta (0.04)
- Monaco (0.04)
- Switzerland (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Alabama (0.04)
- California > San Diego County
- San Diego (0.04)
- Indiana (0.04)
- Maryland (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.34)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- New York > Orange County
- Middletown (0.04)
- Texas (0.04)
- Canada > Ontario
- Africa > Chad
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine
- Diagnostic Medicine (0.93)
- Therapeutic Area (1.00)
- Health & Medicine
- Technology: