Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Phan, Buu, Amos, Brandon, Gat, Itai, Havasi, Marton, Muckley, Matthew, Ullrich, Karen

Oct-11-2024–arXiv.org Artificial Intelligence

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as "tokenization bias". To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves an approximately 18% improvement in FIM coding benchmarks, consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance (up to 3.7%) over individual models across various standard baselines in reasoning, knowledge, and coding. Transformers form the backbone of all widely-used state-of-the-art language models (LMs) such as GPTs (Brown et al., 2020), Llama (Touvron et al., 2023), ans Mistral (Jiang et al., 2023a).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-11-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.93)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found