Benchmarks as Microscopes: A Call for Model Metrology
Saxon, Michael, Holtzman, Ari, West, Peter, Wang, William Yang, Saphra, Naomi
–arXiv.org Artificial Intelligence
Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.
arXiv.org Artificial Intelligence
Jul-30-2024
- Country:
- Asia
- Indonesia > Bali (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- Europe
- Austria (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Italy > Tuscany
- Florence (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- North America
- Canada > British Columbia
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California > Santa Barbara County
- Santa Barbara (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Kansas > Sheridan County (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > Santa Barbara County
- Oceania > Nauru
- Aiwo Constituency > Aiwo District (0.04)
- Asia
- Genre:
- Research Report (0.83)
- Industry:
- Education (0.67)
- Health & Medicine (0.67)
- Information Technology > Security & Privacy (0.67)
- Transportation (0.46)
- Technology: