Holmes: Benchmark the Linguistic Competence of Language Models
Waldis, Andreas, Perlitz, Yotam, Choshen, Leshem, Hou, Yufang, Gurevych, Iryna
–arXiv.org Artificial Intelligence
We introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) - their ability to grasp linguistic phenomena. Unlike prior prompting-based evaluations, Holmes assesses the linguistic competence of LMs via their internal representations using classifier-based probing. In doing so, we disentangle specific phenomena (e.g., part-of-speech of words) from other cognitive abilities, like following textual instructions, and meet recent calls to assess LMs' linguistic competence in isolation. Composing Holmes, we review over 250 probing studies and feature more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version of Holmes designed to lower the high computation load while maintaining high-ranking precision.
arXiv.org Artificial Intelligence
May-22-2024
- Country:
- Africa
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Rwanda > Kigali
- Kigali (0.04)
- Ethiopia > Addis Ababa
- Asia
- China > Hong Kong (0.04)
- Middle East
- Jordan (0.04)
- Qatar > Ad-Dawhah
- Doha (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Singapore (0.04)
- Europe
- Denmark > North Jutland
- Aalborg (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Middle East
- Italy
- Trentino-Alto Adige/Südtirol > Trentino Province
- Trento (0.04)
- Tuscany > Florence (0.04)
- Trentino-Alto Adige/Südtirol > Trentino Province
- France (0.04)
- Slovenia (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Sweden > Uppsala County
- Uppsala (0.04)
- Germany
- Berlin (0.04)
- Hesse > Darmstadt Region
- Darmstadt (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Austria (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Denmark > North Jutland
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Newfoundland and Labrador > Labrador (0.04)
- Ontario > Toronto (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Dominican Republic (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.05)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Ohio > Franklin County
- Columbus (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Texas > Travis County
- Austin (0.04)
- Washington > King County
- Seattle (0.04)
- California > San Diego County
- Canada
- Oceania > Australia
- Africa
- Genre:
- Research Report > Experimental Study (0.46)
- Industry:
- Education (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language
- Chatbot (0.95)
- Grammars & Parsing (0.93)
- Large Language Model (1.00)
- Text Processing (1.00)
- Representation & Reasoning (1.00)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence