Training and Evaluation of a Multilingual Tokenizer for GPT-SW3
–arXiv.org Artificial Intelligence
Generative language models are pre-trained on large amounts of raw text data. Virtually all language model architectures require the text data to be tokenized, which means that a text string is split into a sequence of tokens and subsequently mapped to a sequence of integers, see Figure 1. Figure 1: Text preprocessing for language models (simplified). The first step is referred to as tokenization, although sometimes both the first and second step are embraced by the same term. Note that the character which appears in the above example represents whitespace (more on this in Sec. 3). Modern subword tokenizers are designed such that frequently used words are not decomposed while rare words are split into meaningful tokens.
arXiv.org Artificial Intelligence
Apr-28-2023