Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish

Sep-30-2025–arXiv.org Artificial Intelligence

Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies - word-level, character-level, n-gram, and Byte Pair Encoding (BPE) - on the quality of static word embeddings generated by Word2Vec for Turkish and Finnish. Using a 10,000-article Wikipedia corpus, we trained models under low-resource conditions and evaluated them on a Named Entity Recognition (NER) task. Despite the theoretical appeal of subword segmentation, word-level tokenization consistently outperformed all alternatives across all tokenization strategies tested. These findings suggest that in agglutinative, low-resource contexts, preserving boundaries via word-level tokenization may yield better embedding performance than complex statistical methods. This has practical implications for developing NLP pipelines for under-resourced languages where annotated data and computing power are limited.

artificial intelligence, natural language, tokenization strategy, (14 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Republic of Türkiye
    - Ankara Province > Ankara (0.04)
    - Antalya Province > Antalya (0.04)
    - Istanbul Province > Istanbul (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Middle East
    - Malta > Port Region
      - Southern Harbour District > Valletta (0.04)
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
- North America
  - Canada > Ontario
    - Toronto (0.04)
  - United States
    - Hawaii > Honolulu County
      - Honolulu (0.04)
    - Massachusetts > Suffolk County
      - Boston (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found