Holmes: Benchmark the Linguistic Competence of Language Models
Waldis, Andreas, Perlitz, Yotam, Choshen, Leshem, Hou, Yufang, Gurevych, Iryna
–arXiv.org Artificial Intelligence
We introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) - their ability to grasp linguistic phenomena. Unlike prior prompting-based evaluations, Holmes assesses the linguistic competence of LMs via their internal representations using classifier-based probing. In doing so, we disentangle specific phenomena (e.g., part-of-speech of words) from other cognitive abilities, like following textual instructions, and meet recent calls to assess LMs' linguistic competence in isolation. Composing Holmes, we review over 250 probing studies and feature more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version of Holmes designed to lower the high computation load while maintaining high-ranking precision.
arXiv.org Artificial Intelligence
May-22-2024
- Country:
- Oceania > Australia
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- Texas > Travis County
- Austin (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Louisiana > Orleans Parish
- New Orleans (0.05)
- Hawaii > Honolulu County
- Honolulu (0.04)
- California > San Diego County
- San Diego (0.04)
- Washington > King County
- Canada
- Ontario > Toronto (0.04)
- Newfoundland and Labrador > Labrador (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- Austria (0.04)
- Slovenia (0.04)
- France (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Germany
- Berlin (0.04)
- Hesse > Darmstadt Region
- Darmstadt (0.04)
- Sweden > Uppsala County
- Uppsala (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Italy
- Tuscany > Florence (0.04)
- Trentino-Alto Adige/Südtirol > Trentino Province
- Trento (0.04)
- Middle East
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Denmark > North Jutland
- Aalborg (0.04)
- Asia
- China > Hong Kong (0.04)
- Singapore (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Qatar > Ad-Dawhah
- Doha (0.04)
- Africa
- Rwanda > Kigali
- Kigali (0.04)
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Rwanda > Kigali
- Genre:
- Research Report > Experimental Study (0.46)
- Industry:
- Education (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Representation & Reasoning (1.00)
- Natural Language
- Text Processing (1.00)
- Large Language Model (1.00)
- Chatbot (0.95)
- Grammars & Parsing (0.93)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology > Artificial Intelligence