Token Alignment via Character Matching for Subword Completion

Athiwaratkun, Ben, Wang, Shiqi, Shang, Mingyue, Tian, Yuchen, Wang, Zijian, Gonugondla, Sujan Kumar, Gouda, Sanjay Krishna, Kwiatowski, Rob, Nallapati, Ramesh, Xiang, Bing

arXiv.org Artificial Intelligence 

Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text autocompletion. Generative models have shown remarkable efficacy in a range of applications. However, they have been observed to falter when dealing with partially provided inputs or subwords during text completion. For instance, a generative model might struggle to predict the remaining part of the word where a prompt ending in a subword often leads to incorrect or nonsensical outputs. This issue arises due to the artifact of tokenization where a partial token can be out-of-distribution during inference.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found