An Analysis of Tokenization: Transformers under Markov Data

May-30-2025, 00:32:35 GMT–Neural Information Processing Systems

While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

May-30-2025, 00:32:35 GMT

Conferences PDF

Add feedback

Country:
- Europe
  - Belgium (0.14)
  - Germany (0.14)
- North America > Canada (0.14)

Genre:
- Research Report > Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)
  - Natural Language
    - Chatbot (0.87)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)