The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making
Gourabathina, Abinitha, Hao, Yuexing, Gerych, Walter, Ghassemi, Marzyeh
–arXiv.org Artificial Intelligence
Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.
arXiv.org Artificial Intelligence
Jun-23-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Croatia > Split-Dalmatia County
- Split (0.04)
- France (0.04)
- Croatia > Split-Dalmatia County
- North America > United States
- Alaska (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Asia > Middle East
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Strength High (0.93)
- Research Report
- Industry:
- Education > Health & Safety
- School Nutrition (0.67)
- Health & Medicine
- Consumer Health (1.00)
- Diagnostic Medicine (1.00)
- Pharmaceuticals & Biotechnology (1.00)
- Therapeutic Area
- Infections and Infectious Diseases (0.92)
- Oncology (1.00)
- Psychiatry/Psychology (0.92)
- Information Technology > Security & Privacy (0.92)
- Education > Health & Safety
- Technology: