Challenging the Abilities of Large Language Models in Italian: a Community Initiative
Nissim, Malvina, Croce, Danilo, Patti, Viviana, Basile, Pierpaolo, Attanasio, Giuseppe, Musacchio, Elio, Rinaldi, Matteo, Borazio, Federico, Francis, Maria, Gili, Jacopo, Scalena, Daniel, Altuna, Begoña, Azurmendi, Ekhi, Basile, Valerio, Bentivogli, Luisa, Bisazza, Arianna, Bolognesi, Marianna, Brunato, Dominique, Caselli, Tommaso, Casola, Silvia, Cassese, Maria, Cettolo, Mauro, Collacciani, Claudia, De Cosmo, Leonardo, Di Buono, Maria Pia, Esuli, Andrea, Etxaniz, Julen, Ferrando, Chiara, Fidelangeli, Alessia, Frenda, Simona, Fusco, Achille, Gaido, Marco, Galassi, Andrea, Galli, Federico, Giordano, Luca, Goffetti, Mattia, Gonzalez-Dios, Itziar, Gregori, Lorenzo, Grundler, Giulia, Iannaccone, Sandro, Jiang, Chunyang, La Quatra, Moreno, Lagioia, Francesca, Lo, Soda Marem, Madeddu, Marco, Magnini, Bernardo, Manna, Raffaele, Mercorio, Fabio, Merlo, Paola, Muti, Arianna, Nastase, Vivi, Negri, Matteo, Onorati, Dario, Palmieri, Elena, Papi, Sara, Passaro, Lucia, Pensa, Giulia, Piergentili, Andrea, Potertì, Daniele, Puccetti, Giovanni, Ranaldi, Federico, Ranaldi, Leonardo, Ravelli, Andrea Amelio, Rosola, Martina, Ruzzetti, Elena Sofia, Samo, Giuseppe, Santilli, Andrea, Santin, Piera, Sarti, Gabriele, Sartor, Giovanni, Savoldi, Beatrice, Serino, Antonio, Seveso, Andrea, Siciliani, Lucia, Torroni, Paolo, Varvara, Rossella, Zaninello, Andrea, Zanollo, Asya, Zanzotto, Fabio Massimo, Zeinalipour, Kamyar, Zugarini, Andrea
–arXiv.org Artificial Intelligence
The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.
arXiv.org Artificial Intelligence
Dec-5-2025
- Country:
- Africa > Comoros
- Grande Comore > Moroni (0.04)
- Asia
- China > Hong Kong (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.05)
- Europe
- Spain > Basque Country (0.04)
- Portugal
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Italy
- Apulia > Bari (0.04)
- Emilia-Romagna > Metropolitan City of Bologna
- Bologna (0.04)
- Tuscany
- Florence (0.04)
- Pisa Province > Pisa (0.05)
- United Kingdom > England
- South Yorkshire > Sheffield (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany
- Bavaria > Upper Bavaria
- Munich (0.04)
- Berlin (0.04)
- Bavaria > Upper Bavaria
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Austria > Vienna (0.14)
- North America
- Canada (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Montana (0.14)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New York (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Arizona > Maricopa County
- Phoenix (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Oceania > Australia (0.04)
- Africa > Comoros
- Genre:
- Overview (1.00)
- Research Report > New Finding (1.00)
- Industry:
- Education > Educational Setting
- K-12 Education (0.45)
- Government > Regional Government (0.67)
- Health & Medicine (1.00)
- Information Technology (0.92)
- Law (1.00)
- Media > News (0.67)
- Education > Educational Setting
- Technology: