Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

Apr-28-2023–arXiv.org Artificial Intelligence

Generative language models are pre-trained on large amounts of raw text data. Virtually all language model architectures require the text data to be tokenized, which means that a text string is split into a sequence of tokens and subsequently mapped to a sequence of integers, see Figure 1. Figure 1: Text preprocessing for language models (simplified). The first step is referred to as tokenization, although sometimes both the first and second step are embraced by the same term. Note that the character which appears in the above example represents whitespace (more on this in Sec. 3). Modern subword tokenizers are designed such that frequently used words are not decomposed while rare words are split into meaningful tokens.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

Apr-28-2023

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found