Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish
–arXiv.org Artificial Intelligence
Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies - word-level, character-level, n-gram, and Byte Pair Encoding (BPE) - on the quality of static word embeddings generated by Word2Vec for Turkish and Finnish. Using a 10,000-article Wikipedia corpus, we trained models under low-resource conditions and evaluated them on a Named Entity Recognition (NER) task. Despite the theoretical appeal of subword segmentation, word-level tokenization consistently outperformed all alternatives across all tokenization strategies tested. These findings suggest that in agglutinative, low-resource contexts, preserving boundaries via word-level tokenization may yield better embedding performance than complex statistical methods. This has practical implications for developing NLP pipelines for under-resourced languages where annotated data and computing power are limited.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia > Middle East
- Republic of Türkiye
- Ankara Province > Ankara (0.04)
- Antalya Province > Antalya (0.04)
- Istanbul Province > Istanbul (0.04)
- Republic of Türkiye
- Europe
- Austria > Vienna (0.14)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Middle East
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Hawaii > Honolulu County
- Honolulu (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Hawaii > Honolulu County
- Canada > Ontario
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Technology: