HealthBench: Evaluating Large Language Models Towards Improved Human Health
Arora, Rahul K., Wei, Jason, Hicks, Rebecca Soskin, Bowman, Preston, Quiñonero-Candela, Joaquin, Tsimpourlas, Foivos, Sharman, Michael, Shah, Meghan, Vallone, Andrea, Beutel, Alex, Heidecke, Johannes, Singhal, Karan
–arXiv.org Artificial Intelligence
HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, Health-Bench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.
arXiv.org Artificial Intelligence
May-14-2025
- Country:
- Africa
- Cameroon (0.04)
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Kenya
- Kisumu County > Kisumu (0.04)
- Nairobi City County > Nairobi (0.04)
- Nairobi Province (0.04)
- Narok County > Narok (0.04)
- Madagascar (0.04)
- Middle East
- Egypt (0.04)
- Tunisia > Tunis Governorate
- Tunis (0.04)
- Nigeria > Lagos State (0.04)
- Asia
- Bangladesh (0.04)
- Bhutan (0.04)
- India
- Maharashtra > Pune (0.04)
- NCT > Delhi (0.04)
- Tamil Nadu (0.04)
- Middle East
- Kuwait > Farwaniya Governorate
- Farwaniya (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Kuwait > Farwaniya Governorate
- Nepal > Bagmati Province
- Kathmandu District > Kathmandu (0.04)
- Pakistan > Sindh
- Karachi Division > Karachi (0.04)
- Uzbekistan (0.04)
- Europe
- Bosnia and Herzegovina > Federation of Bosnia and Herzegovina
- Sarajevo Canton > Sarajevo (0.04)
- Czechia > South Moravian Region
- Brno (0.04)
- Italy > Piedmont
- Turin Province > Turin (0.04)
- Jersey (0.14)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Ukraine > Kharkiv Oblast
- Kharkiv (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Bosnia and Herzegovina > Federation of Bosnia and Herzegovina
- North America
- Canada
- Alberta > Census Division No. 6
- Calgary Metropolitan Region > Calgary (0.04)
- Ontario > Toronto (0.14)
- Quebec > Montreal (0.04)
- Alberta > Census Division No. 6
- United States
- California
- Los Angeles County > Los Angeles (0.04)
- San Francisco County > San Francisco (0.14)
- Santa Clara County > Palo Alto (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.14)
- Wisconsin (0.04)
- California
- Canada
- Oceania > Australia
- South America > Brazil
- Santa Catarina (0.04)
- Africa
- Genre:
- Research Report
- Experimental Study (1.00)
- Strength High (0.92)
- Research Report
- Industry:
- Health & Medicine
- Consumer Health (1.00)
- Diagnostic Medicine (0.87)
- Health Care Providers & Services (1.00)
- Nuclear Medicine (0.67)
- Surgery (1.00)
- Therapeutic Area
- Neurology (1.00)
- Oncology (0.67)
- Psychiatry/Psychology (1.00)
- Health & Medicine
- Technology: