NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

Mikhailov, Vladislav, Enstad, Tita, Samuel, David, Farsethås, Hans Christian, Kutuzov, Andrey, Velldal, Erik, Øvrelid, Lilja

Jun-6-2025–arXiv.org Artificial Intelligence

This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets -- of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokmål and Nynorsk. All our datasets and a collection of over 100 human-written prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pre-trained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jun-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- North America > United States (0.46)
- Europe
  - Estonia (0.28)
  - Italy (0.28)

Genre:
- Overview (0.93)
- Research Report > New Finding (0.46)

Industry:
- Education (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.50)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found