Tokenization Algorithms Explained
For the uninitiated, let's start by formally introducing the concept of tokenization -- Tokenization is simply a method of splitting input textual data into individual separate meaningful tokens that can be further understood and processed by machines. Tokens can be words, characters, or even sub-words depending on what splitting algorithm is being employed. We'd discuss all the 3 major categories of tokens -- words, characters, and sub-words in this article. We'd also focus on the sub-word tokenization algorithms that most of the recent SOTA models make use of -- Byte-Pair Encoding (BPE), Word Piece, Unigram, and Sentence Piece. By the end of this discussion, you'll have developed a concrete understanding of each of the above avenues and would be well equipped to decide which tokenization method suits best your needs.
Aug-3-2021, 03:25:47 GMT
- Technology: