Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Vemula, Saketh Reddy, Dandapat, Sandipan, Sharma, Dipti Misra, Krishnamurthy, Parameswari
–arXiv.org Artificial Intelligence
The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models -- from pre-training through fine-tuning -- for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.
arXiv.org Artificial Intelligence
Nov-11-2025
- Country:
- Asia
- Europe
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany > Berlin (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Croatia > Dubrovnik-Neretva County
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Florida > Miami-Dade County
- Canada
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Materials > Metals & Mining > Gold (0.34)
- Technology: