Evalita-LLM: Benchmarking Large Language Models on Italian

Magnini, Bernardo, Zanoli, Roberto, Resta, Michele, Cimmino, Martin, Albano, Paolo, Madeddu, Marco, Patti, Viviana

Feb-4-2025–arXiv.org Artificial Intelligence

We describe Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation. We propose an iterative methodology, where candidate tasks and candidate prompts are validated against a set of LLMs used for development. We report experimental results from the benchmark's development phase, and provide performance statistics for several state-of-the-art LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-4-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Italy > Piedmont > Turin Province > Turin (0.14)

Genre:
- Overview (0.92)
- Research Report (0.82)

Industry:
- Health & Medicine > Therapeutic Area (0.46)
- Leisure & Entertainment (0.67)
- Materials (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.45)
  - Natural Language > Large Language Model (1.00)