UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
Imperial, Joseph Marvin, Barayan, Abdullah, Stodden, Regina, Wilkens, Rodrigo, Sanchez, Ricardo Munoz, Gao, Lingyun, Torgbi, Melissa, Knight, Dawn, Forey, Gail, Jablonkai, Reka R., Kochmar, Ekaterina, Reynolds, Robert, Ribeiro, Eugénio, Saggion, Horacio, Volodina, Elena, Vajjala, Sowmya, François, Thomas, Alva-Manchego, Fernando, Madabushi, Harish Tayyar
–arXiv.org Artificial Intelligence
We introduce UniversalCEFR, a large-scale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference) levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.
arXiv.org Artificial Intelligence
Sep-17-2025
- Country:
- Asia
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Philippines (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Japan > Kyūshū & Okinawa
- Europe
- Estonia > Tartu County
- Tartu (0.04)
- Portugal
- United Kingdom
- England > Cambridgeshire
- Cambridge (0.04)
- Scotland > City of Edinburgh
- Edinburgh (0.04)
- England > Cambridgeshire
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Sweden
- Uppsala County > Uppsala (0.04)
- Vaestra Goetaland > Gothenburg (0.04)
- Belgium (0.04)
- Italy
- Slovenia (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Netherlands (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany (0.04)
- Finland > Northern Ostrobothnia
- Oulu (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Bulgaria (0.04)
- Spain > Galicia
- A Coruña Province > Santiago de Compostela (0.04)
- Estonia > Tartu County
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico
- Jalisco > Guadalajara (0.04)
- Mexico City > Mexico City (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Maine (0.04)
- Massachusetts > Middlesex County
- Somerville (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- California > San Diego County
- Canada > Ontario
- South America > Chile
- Asia
- Genre:
- Research Report > New Finding (0.87)
- Industry:
- Technology: