Holistic Evaluation of Language Models

Liang, Percy, Bommasani, Rishi, Lee, Tony, Tsipras, Dimitris, Soylu, Dilara, Yasunaga, Michihiro, Zhang, Yian, Narayanan, Deepak, Wu, Yuhuai, Kumar, Ananya, Newman, Benjamin, Yuan, Binhang, Yan, Bobby, Zhang, Ce, Cosgrove, Christian, Manning, Christopher D., Ré, Christopher, Acosta-Navas, Diana, Hudson, Drew A., Zelikman, Eric, Durmus, Esin, Ladhak, Faisal, Rong, Frieda, Ren, Hongyu, Yao, Huaxiu, Wang, Jue, Santhanam, Keshav, Orr, Laurel, Zheng, Lucia, Yuksekgonul, Mert, Suzgun, Mirac, Kim, Nathan, Guha, Neel, Chatterji, Niladri, Khattab, Omar, Henderson, Peter, Huang, Qian, Chi, Ryan, Xie, Sang Michael, Santurkar, Shibani, Ganguli, Surya, Hashimoto, Tatsunori, Icard, Thomas, Zhang, Tianyi, Chaudhary, Vishrav, Wang, William, Li, Xuechen, Mai, Yifan, Zhang, Yuhui, Koreeda, Yuta

Oct-1-2023–arXiv.org Artificial Intelligence

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

civilcomment raft boolq narrativeqa naturalquestion, civilcomment raft mmlu boolqnarrativeqa naturalquestion, demographic representation and stereotypical association, (13 more...)

arXiv.org Artificial Intelligence

Oct-1-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- South America
  - Chile > Santiago Metropolitan Region
    - Santiago Province > Santiago (0.04)
  - Brazil > Rio de Janeiro
    - Rio de Janeiro (0.04)
- North America
  - Jamaica (0.04)
  - Dominican Republic (0.04)
  - United States
    - Maryland > Baltimore (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.13)
    - Arizona > Maricopa County
      - Phoenix (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Pennsylvania > Philadelphia County
      - Philadelphia (0.04)
    - Massachusetts > Middlesex County
      - Cambridge (0.13)
    - Illinois > Cook County
      - Chicago (0.04)
    - Wisconsin > Dane County
      - Madison (0.04)
    - Washington > King County
      - Seattle (0.13)
    - Alaska > Anchorage Municipality
      - Anchorage (0.04)
    - California
      - San Francisco County > San Francisco (0.14)
      - Santa Clara County > Palo Alto (0.05)
      - San Diego County > San Diego (0.04)
    - New York > New York County
      - New York City (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - France (0.04)
  - Switzerland > Zürich
    - Zürich (0.04)
  - Germany
    - Saxony > Leipzig (0.04)
    - Saarland > Saarbrücken (0.04)
  - Spain
    - Valencian Community > Valencia Province
      - Valencia (0.04)
    - Galicia > A Coruña Province
      - Santiago de Compostela (0.04)
    - Catalonia > Barcelona Province
      - Barcelona (0.04)
  - Romania > Sud - Muntenia Development Region
    - Giurgiu County > Giurgiu (0.04)
  - United Kingdom > England
    - Oxfordshire > Oxford (0.04)
    - Cambridgeshire > Cambridge (0.04)
  - Greece > Attica
    - Athens (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Italy
    - Tuscany > Florence (0.04)
    - Calabria > Catanzaro Province
      - Catanzaro (0.04)
  - Slovenia > Drava
    - Municipality of Benedikt > Benedikt (0.04)
  - Netherlands > South Holland
    - The Hague (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Myanmar (0.04)
  - China > Hong Kong (0.04)
  - Singapore (0.04)
  - Philippines (0.04)
  - Indonesia > Bali (0.04)
  - Japan > Honshū
    - Chūbu > Toyama Prefecture > Toyama (0.04)
  - India > NCT
    - New Delhi (0.04)
    - Delhi (0.04)
- Africa
  - Tanzania (0.04)
  - Kenya (0.04)
  - East Africa (0.04)
  - Middle East > Egypt (0.04)

Genre:
- Research Report > New Finding (1.00)
- Personal (1.00)
- Overview (1.00)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Consumer Products & Services (1.00)
- Information Technology (1.00)
- Energy (1.00)
- Law (1.00)
- Media > News (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Banking & Finance (1.00)
- Transportation > Air (0.92)
- Government
  - Military (1.00)
  - Regional Government > North America Government
    - United States Government (1.00)
- Education
  - Educational Setting (0.92)
  - Curriculum > Subject-Specific Education (0.67)
- Leisure & Entertainment > Sports
  - Soccer (0.67)

Technology:
- Information Technology
  - Communications > Social Media
    - Crowdsourcing (0.67)
  - Artificial Intelligence
    - Representation & Reasoning > Expert Systems (0.92)
    - Natural Language
      - Text Processing (1.00)
      - Large Language Model (1.00)
      - Information Retrieval (1.00)
      - Grammars & Parsing (1.00)
      - Chatbot (0.94)
      - Discourse & Dialogue (0.92)
      - Generation (0.67)
    - Machine Learning
      - Neural Networks > Deep Learning (1.00)
      - Inductive Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found