Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices
Vintar, Špela, Pungeršek, Taja Kuzman, Brglez, Mojca, Ljubešić, Nikola
–arXiv.org Artificial Intelligence
While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
arXiv.org Artificial Intelligence
Nov-5-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Yemen > Amran Governorate
- Amran (0.04)
- UAE > Abu Dhabi Emirate
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California > Santa Clara County
- Palo Alto (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- California > Santa Clara County
- Mexico > Mexico City
- Asia
- Genre:
- Overview (1.00)
- Industry:
- Education (0.68)
- Government (0.67)
- Health & Medicine (0.46)
- Technology: