AITopics | sub-word tokenization

Collaborating Authors

sub-word tokenization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Tan, Yang, Li, Mingchen, Tan, Pan, Zhou, Ziyi, Yu, Huiqun, Fan, Guisheng, Hong, Liang

arXiv.org Artificial IntelligenceOct-26-2023

Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.

downstream application, protein transfer learning, sub-word tokenization, (1 more...)

arXiv.org Artificial Intelligence

2310.17415

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.53)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.60)

Add feedback

Sub-Character Tokenization for Chinese Pretrained Language Models

Si, Chenglei, Zhang, Zhengyan, Chen, Yingfa, Qi, Fanchao, Wang, Xiaozhi, Liu, Zhiyuan, Wang, Yasheng, Liu, Qun, Sun, Maosong

arXiv.org Artificial IntelligenceFeb-14-2023

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2106.004

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Real or Not? Disaster Tweets classification with RoBERTa

#artificialintelligenceSep-2-2022, 16:56:15 GMT

This article was published as a part of the Data Science Blogathon. Today we live in a world of active social networking where every kind of information is shared among users worldwide. This is greatly facilitated by the ubiquitousness of smartphones and other handheld communication devices. Some popular sites are Facebook, Whatsapp, LinkedIn, etc.; however, Twitter is a viral microblogging site used worldwide for open information exchange. On Twitter, various types of information are exchanged in the form of short messages that include information regarding any mishaps or accidents happening worldwide.

classification, dataset, tweet, (15 more...)

#artificialintelligence

Industry: Information Technology > Services (0.69)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback

Tokenization Algorithms Explained

#artificialintelligenceAug-3-2021, 03:25:47 GMT

For the uninitiated, let's start by formally introducing the concept of tokenization -- Tokenization is simply a method of splitting input textual data into individual separate meaningful tokens that can be further understood and processed by machines. Tokens can be words, characters, or even sub-words depending on what splitting algorithm is being employed. We'd discuss all the 3 major categories of tokens -- words, characters, and sub-words in this article. We'd also focus on the sub-word tokenization algorithms that most of the recent SOTA models make use of -- Byte-Pair Encoding (BPE), Word Piece, Unigram, and Sentence Piece. By the end of this discussion, you'll have developed a concrete understanding of each of the above avenues and would be well equipped to decide which tokenization method suits best your needs.

sub-word tokenization, tokenization, tokenization algorithm explained, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language (0.89)

Add feedback