Batching BPE Tokenization Merges

Aug-5-2024–arXiv.org Artificial Intelligence

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.

batch, merge, text chunk, (12 more...)

arXiv.org Artificial Intelligence

Aug-5-2024

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Ontario > Toronto (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.96)
  - Natural Language
    - Large Language Model (0.49)
    - Text Processing (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found