Holmes: Benchmark the Linguistic Competence of Language Models

Waldis, Andreas, Perlitz, Yotam, Choshen, Leshem, Hou, Yufang, Gurevych, Iryna

May-22-2024–arXiv.org Artificial Intelligence

We introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) - their ability to grasp linguistic phenomena. Unlike prior prompting-based evaluations, Holmes assesses the linguistic competence of LMs via their internal representations using classifier-based probing. In doing so, we disentangle specific phenomena (e.g., part-of-speech of words) from other cognitive abilities, like following textual instructions, and meet recent calls to assess LMs' linguistic competence in isolation. Composing Holmes, we review over 250 probing studies and feature more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version of Holmes designed to lower the high computation load while maintaining high-ranking precision.

computational linguistic, linguistic, proceedings, (15 more...)

arXiv.org Artificial Intelligence

May-22-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Texas > Travis County
      - Austin (0.04)
    - Pennsylvania > Philadelphia County
      - Philadelphia (0.04)
    - Ohio > Franklin County
      - Columbus (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Louisiana > Orleans Parish
      - New Orleans (0.05)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
    - California > San Diego County
      - San Diego (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - Newfoundland and Labrador > Labrador (0.04)
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Austria (0.04)
  - Slovenia (0.04)
  - France (0.04)
  - Netherlands > North Holland
    - Amsterdam (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)
  - Germany
    - Berlin (0.04)
    - Hesse > Darmstadt Region
      - Darmstadt (0.04)
  - Sweden > Uppsala County
    - Uppsala (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Italy
    - Tuscany > Florence (0.04)
    - Trentino-Alto Adige/Südtirol > Trentino Province
      - Trento (0.04)
  - Middle East
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
    - Malta > Eastern Region
      - Northern Harbour District > St. Julian's (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
  - Denmark > North Jutland
    - Aalborg (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Singapore (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
    - Qatar > Ad-Dawhah
      - Doha (0.04)
- Africa
  - Rwanda > Kigali
    - Kigali (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Research Report > Experimental Study (0.46)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language
    - Text Processing (1.00)
    - Large Language Model (1.00)
    - Chatbot (0.95)
    - Grammars & Parsing (0.93)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found