A Formal Perspective on Byte-Pair Encoding

Zouhar, Vilém, Meister, Clara, Gastaldi, Juan Luis, Du, Li, Vieira, Tim, Sachan, Mrinmaya, Cotterell, Ryan

Jun-29-2023–arXiv.org Artificial Intelligence

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$-approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

artificial intelligence, optimization problem, sequence, (17 more...)

arXiv.org Artificial Intelligence

Jun-29-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.14)
- North America > United States
  - Maryland (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning
  - Optimization (0.54)
  - Search (0.87)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found