Evaluating Morphological Alignment of Tokenizers in 70 Languages
Arnett, Catherine, Hudspeth, Marisa, O'Connor, Brendan
–arXiv.org Artificial Intelligence
While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.
arXiv.org Artificial Intelligence
Jul-10-2025
- Country:
- Asia
- China (0.04)
- Indonesia > Bali (0.04)
- Japan > Kyūshū & Okinawa
- Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Middle East
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Republic of Türkiye > Istanbul Province
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Czechia
- Prague (0.04)
- South Moravian Region > Brno (0.04)
- Sweden
- Vaestra Goetaland > Gothenburg (0.04)
- Östergötland County > Linköping (0.04)
- Hungary > Csongrád-Csanád County
- Szeged (0.04)
- Estonia
- Harju County > Tallinn (0.04)
- Tartu County > Tartu (0.04)
- North Macedonia > Skopje Statistical Region
- Skopje Municipality > Skopje (0.04)
- Middle East
- France
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Marseille (0.04)
- Île-de-France > Paris
- Paris (0.04)
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Slovenia (0.04)
- Finland > Southwest Finland
- Turku (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Netherlands (0.04)
- Spain
- Germany
- Berlin (0.04)
- Saarland > Saarbrücken (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- District of Columbia > Washington (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Massachusetts > Hampshire County
- Amherst (0.04)
- Washington > King County
- Seattle (0.04)
- Canada > Ontario
- South America
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language
- Chatbot (0.35)
- Grammars & Parsing (0.46)
- Large Language Model (0.47)
- Information Technology > Artificial Intelligence