Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations
Wei, Kevin L., Paskov, Patricia, Dev, Sunishchal, Byun, Michael J., Reuel, Anka, Roberts-Gaal, Xavier, Calcott, Rachel, Coxon, Evie, Deshpande, Chinmay
–arXiv.org Artificial Intelligence
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines
arXiv.org Artificial Intelligence
Nov-4-2025
- Country:
- Africa > Eswatini
- Asia
- Indonesia > Bali (0.04)
- Japan > Honshū
- Chūbu > Toyama Prefecture
- Toyama (0.04)
- Tōhoku > Iwate Prefecture
- Morioka (0.04)
- Chūbu > Toyama Prefecture
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Middle East > Cyprus
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- Monaco (0.04)
- Switzerland (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Germany > Saxony
- Leipzig (0.04)
- Italy > Tuscany
- Florence (0.04)
- Austria > Vienna (0.14)
- Sweden > Stockholm
- Stockholm (0.04)
- Ireland > Leinster
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California
- Los Angeles County > Santa Monica (0.04)
- Santa Clara County > Stanford (0.04)
- District of Columbia > Washington (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.14)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Washington > King County
- Seattle (0.04)
- Wisconsin (0.04)
- California
- Canada > Ontario
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Genre:
- Instructional Material (0.92)
- Overview (0.93)
- Questionnaire & Opinion Survey (1.00)
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Industry:
- Technology:
- Information Technology
- Artificial Intelligence
- Applied AI (0.92)
- Cognitive Science (0.92)
- Issues > Social & Ethical Issues (0.93)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language
- Chatbot (1.00)
- Large Language Model (1.00)
- Representation & Reasoning > Commonsense Reasoning (0.67)
- Communications > Social Media
- Crowdsourcing (0.67)
- Data Science > Data Mining (0.92)
- Artificial Intelligence
- Information Technology