Goto

Collaborating Authors

 sub-word tokenization


PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

arXiv.org Artificial Intelligence

Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.


Sub-Character Tokenization for Chinese Pretrained Language Models

arXiv.org Artificial Intelligence

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.


Real or Not? Disaster Tweets classification with RoBERTa

#artificialintelligence

This article was published as a part of the Data Science Blogathon. Today we live in a world of active social networking where every kind of information is shared among users worldwide. This is greatly facilitated by the ubiquitousness of smartphones and other handheld communication devices. Some popular sites are Facebook, Whatsapp, LinkedIn, etc.; however, Twitter is a viral microblogging site used worldwide for open information exchange. On Twitter, various types of information are exchanged in the form of short messages that include information regarding any mishaps or accidents happening worldwide.


Tokenization Algorithms Explained

#artificialintelligence

For the uninitiated, let's start by formally introducing the concept of tokenization -- Tokenization is simply a method of splitting input textual data into individual separate meaningful tokens that can be further understood and processed by machines. Tokens can be words, characters, or even sub-words depending on what splitting algorithm is being employed. We'd discuss all the 3 major categories of tokens -- words, characters, and sub-words in this article. We'd also focus on the sub-word tokenization algorithms that most of the recent SOTA models make use of -- Byte-Pair Encoding (BPE), Word Piece, Unigram, and Sentence Piece. By the end of this discussion, you'll have developed a concrete understanding of each of the above avenues and would be well equipped to decide which tokenization method suits best your needs.