An Analysis of Tokenization: Transformers under Markov Data
–Neural Information Processing Systems
While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes.
Neural Information Processing Systems
May-30-2025, 00:32:35 GMT
- Country:
- Europe
- North America > Canada (0.14)
- Genre:
- Research Report > Experimental Study (1.00)
- Technology: