Tokenisation is NP-Complete

Whittington, Philip, Bachmann, Gregor, Pimentel, Tiago

Dec-19-2024–arXiv.org Artificial Intelligence

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Dec-19-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.93)
- Europe (0.93)
- North America
  - Mexico (0.28)
  - United States (0.28)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Natural Language > Large Language Model (0.68)