APartition Cover Approach for Tokenization

Jun-17-2026, 17:18:27 GMT–Neural Information Processing Systems

Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GREEDTOK. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple (1 1/e)-approximation algorithm GREEDWMC. Through empirical evaluations on real-world corpora, we show that GREEDTOK outperforms BPE and UNIGRAM on compression and achieves a covering score comparable to GREEDWMC.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Jun-17-2026, 17:18:27 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.46)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Education (0.68)
- Information Technology (0.67)
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found