AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Zbeeb, Mohammad, Hammoud, Hasan Abed Al Kader, Mukalled, Sina, Rizk, Nadine, Karnib, Fatima, Lakkis, Issam, Mohanna, Ammar, Ghanem, Bernard
–arXiv.org Artificial Intelligence
The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
arXiv.org Artificial Intelligence
Dec-10-2025
- Country:
- Asia
- China (0.04)
- Middle East
- Jordan (0.04)
- Lebanon > Beirut Governorate
- Beirut (0.05)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Austria > Vienna (0.14)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Switzerland > Basel-City
- Basel (0.04)
- Ukraine > Kyiv Oblast
- Kyiv (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- Canada > Ontario
- Oceania > Tonga (0.04)
- Asia
- Genre:
- Questionnaire & Opinion Survey (0.34)
- Research Report (0.51)
- Industry:
- Education > Assessment & Standards > Student Performance (0.35)
- Technology: