APartition Cover Approach for Tokenization
–Neural Information Processing Systems
Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GREEDTOK. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple (1 1/e)-approximation algorithm GREEDWMC. Through empirical evaluations on real-world corpora, we show that GREEDTOK outperforms BPE and UNIGRAM on compression and achieves a covering score comparable to GREEDWMC.
Neural Information Processing Systems
Jun-17-2026, 17:18:27 GMT
- Country:
- Asia (0.46)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Education (0.68)
- Information Technology (0.67)
- Health & Medicine (0.67)
- Technology: