Token Alignment via Character Matching for Subword Completion

Athiwaratkun, Ben, Wang, Shiqi, Shang, Mingyue, Tian, Yuchen, Wang, Zijian, Gonugondla, Sujan Kumar, Gouda, Sanjay Krishna, Kwiatowski, Rob, Nallapati, Ramesh, Xiang, Bing

Mar-13-2024–arXiv.org Artificial Intelligence

Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text autocompletion. Generative models have shown remarkable efficacy in a range of applications. However, they have been observed to falter when dealing with partially provided inputs or subwords during text completion. For instance, a generative model might struggle to predict the remaining part of the word where a prompt ending in a subword often leads to incorrect or nonsensical outputs. This issue arises due to the artifact of tokenization where a partial token can be out-of-distribution during inference.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

Mar-13-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Hawaii (0.14)
  - New York (0.14)
  - Texas (0.14)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found