Tokenization Algorithms Explained

Aug-3-2021, 03:25:47 GMT–#artificialintelligence

For the uninitiated, let's start by formally introducing the concept of tokenization -- Tokenization is simply a method of splitting input textual data into individual separate meaningful tokens that can be further understood and processed by machines. Tokens can be words, characters, or even sub-words depending on what splitting algorithm is being employed. We'd discuss all the 3 major categories of tokens -- words, characters, and sub-words in this article. We'd also focus on the sub-word tokenization algorithms that most of the recent SOTA models make use of -- Byte-Pair Encoding (BPE), Word Piece, Unigram, and Sentence Piece. By the end of this discussion, you'll have developed a concrete understanding of each of the above avenues and would be well equipped to decide which tokenization method suits best your needs.

sub-word tokenization, tokenization, tokenization algorithm explained, (13 more...)

#artificialintelligence

Aug-3-2021, 03:25:47 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.89)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found