Holistic Evaluation of Language Models
Liang, Percy, Bommasani, Rishi, Lee, Tony, Tsipras, Dimitris, Soylu, Dilara, Yasunaga, Michihiro, Zhang, Yian, Narayanan, Deepak, Wu, Yuhuai, Kumar, Ananya, Newman, Benjamin, Yuan, Binhang, Yan, Bobby, Zhang, Ce, Cosgrove, Christian, Manning, Christopher D., Ré, Christopher, Acosta-Navas, Diana, Hudson, Drew A., Zelikman, Eric, Durmus, Esin, Ladhak, Faisal, Rong, Frieda, Ren, Hongyu, Yao, Huaxiu, Wang, Jue, Santhanam, Keshav, Orr, Laurel, Zheng, Lucia, Yuksekgonul, Mert, Suzgun, Mirac, Kim, Nathan, Guha, Neel, Chatterji, Niladri, Khattab, Omar, Henderson, Peter, Huang, Qian, Chi, Ryan, Xie, Sang Michael, Santurkar, Shibani, Ganguli, Surya, Hashimoto, Tatsunori, Icard, Thomas, Zhang, Tianyi, Chaudhary, Vishrav, Wang, William, Li, Xuechen, Mai, Yifan, Zhang, Yuhui, Koreeda, Yuta
–arXiv.org Artificial Intelligence
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
arXiv.org Artificial Intelligence
Oct-1-2023
- Country:
- Africa
- East Africa (0.04)
- Kenya (0.04)
- Middle East > Egypt (0.04)
- Tanzania (0.04)
- Asia
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Netherlands > South Holland
- The Hague (0.04)
- Slovenia > Drava
- Municipality of Benedikt > Benedikt (0.04)
- Italy
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- France (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Greece > Attica
- Athens (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- Romania > Sud - Muntenia Development Region
- Giurgiu County > Giurgiu (0.04)
- Spain
- Germany
- Saarland > Saarbrücken (0.04)
- Saxony > Leipzig (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Dominican Republic (0.04)
- Jamaica (0.04)
- United States
- New York > New York County
- New York City (0.04)
- California
- San Diego County > San Diego (0.04)
- San Francisco County > San Francisco (0.14)
- Santa Clara County > Palo Alto (0.05)
- Alaska > Anchorage Municipality
- Anchorage (0.04)
- Washington > King County
- Seattle (0.13)
- Wisconsin > Dane County
- Madison (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.13)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Arizona > Maricopa County
- Phoenix (0.04)
- Maryland > Baltimore (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.13)
- New York > New York County
- Canada
- Oceania > Australia (0.04)
- South America
- Brazil > Rio de Janeiro
- Rio de Janeiro (0.04)
- Chile > Santiago Metropolitan Region
- Santiago Province > Santiago (0.04)
- Brazil > Rio de Janeiro
- Africa
- Genre:
- Overview (1.00)
- Personal (1.00)
- Research Report > New Finding (1.00)
- Industry:
- Leisure & Entertainment > Sports
- Soccer (0.67)
- Banking & Finance (1.00)
- Education
- Curriculum > Subject-Specific Education (0.67)
- Educational Setting (0.92)
- Health & Medicine > Therapeutic Area (1.00)
- Government
- Media > News (1.00)
- Transportation > Air (0.92)
- Law (1.00)
- Energy (1.00)
- Information Technology (1.00)
- Consumer Products & Services (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Leisure & Entertainment > Sports
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning
- Inductive Learning (0.67)
- Neural Networks > Deep Learning (1.00)
- Natural Language
- Chatbot (0.94)
- Discourse & Dialogue (0.92)
- Generation (0.67)
- Grammars & Parsing (1.00)
- Information Retrieval (1.00)
- Large Language Model (1.00)
- Text Processing (1.00)
- Representation & Reasoning > Expert Systems (0.92)
- Machine Learning
- Communications > Social Media
- Crowdsourcing (0.67)
- Artificial Intelligence
- Information Technology