How Much is Enough? The Diminishing Returns of Tokenization Training Data

Reddy, Varshini, Schmidt, Craig W., Pinter, Yuval, Tanner, Chris

Feb-27-2025–arXiv.org Artificial Intelligence

Tokenization, a crucial initial step in natural language processing, is often assumed to benefit from larger training datasets. This paper investigates the impact of tokenizer training data sizes ranging from 1GB to 900GB. Our findings reveal diminishing returns as the data size increases, highlighting a practical limit on how much further scaling the training data can improve tokenization quality. We analyze this phenomenon and attribute the saturation effect to the constraints imposed by the pre-tokenization stage of tokenization. These results offer valuable insights for optimizing the tokenization process and highlight potential avenues for future research in tokenization algorithms.

tokenizer, training data, vocabulary size, (14 more...)

arXiv.org Artificial Intelligence

Feb-27-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - Virginia (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Massachusetts > Middlesex County
      - Cambridge (0.05)
    - Florida > Miami-Dade County
      - Miami (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
- Asia
  - Singapore (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.14)
    - Israel > Southern District
      - Beer-Sheva (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine (0.68)
- Law (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found