AITopics | sentencepiece

Collaborating Authors

sentencepiece

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

bf64451da212313c5ef1a00f49232c47-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 21:42:49 GMT

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada (0.04)
Europe > Spain (0.04)
(3 more...)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
(3 more...)

Add feedback

df4f371f1f89ec8ba5014b3310578048-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 09:32:04 GMT

computational linguistic, hyperparameter, sentencepiece, (7 more...)

Neural Information Processing Systems

Country: Europe > Belgium > Brussels-Capital Region > Brussels (0.06)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

RETVec: Resilient and Efficient Text Vectorizer

Neural Information Processing SystemsOct-9-2025, 06:22:51 GMT

This paper describes RETV ec, an efficient, resilient, and multilingual text vec-torizer designed for neural-based text processing. RETV ec combines a novel character encoding with an optional small embedding model to embed words into a 256-dimensional vector space. The RETV ec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks. In this paper, we evaluate and compare RETV ec to state-of-the-art vectorizers and word embeddings on popular model architectures and datasets. These comparisons demonstrate that RETV ec leads to competitive, multilingual models that are significantly more resilient to typos and adversarial text attacks.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada (0.04)
Europe > Spain (0.04)
(3 more...)

Industry:

Information Technology > Security & Privacy (0.68)
Government > Military (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback

Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha

Wangchuk, Tandin, Gonsalves, Tad

arXiv.org Artificial IntelligenceSep-22-2025

Large Language Models (LLMs) are gaining popularity and improving rapidly. Tokenizers are crucial components of natural language processing, especially for LLMs. Tokenizers break down input text into tokens that models can easily process while ensuring the text is accurately represented, capturing its meaning and structure. Effective tokenizers enhance the capabilities of LLMs by improving a model's understanding of context and semantics, ultimately leading to better performance in various downstream tasks, such as translation, classification, sentiment analysis, and text generation. Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages. Dzongkha, Bhutan's national language spoken by around seven hundred thousand people, is a low-resource language, and its linguistic complexity poses unique NLP challenges. Despite some progress, significant research in Dzongkha NLP is lacking, particularly in tokenization. This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods. Specifically, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece (Unigram) were evaluated for their suitability for Dzongkha. Performance was assessed using metrics like Subword Fertility, Proportion of Continued Words, Normalized Sequence Length, and execution time. The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization, paving the way for further NLP advancements. This underscores the need for tailored approaches for low-resource languages and ongoing research. In this study, we presented three tokenization algorithms for Dzongkha, paving the way for building Dzongkha Large Language Models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.15255

Country:

Europe (0.46)
Asia > Bhutan (0.25)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

The State of Large Language Models for African Languages: Progress and Challenges

Hussen, Kedir Yassin, Sewunetie, Walelign Tewabe, Ayele, Abinew Ali, Imam, Sukairaj Hafiz, Muhammad, Shamsuddeen Hassan, Yimam, Seid Muhie

arXiv.org Artificial IntelligenceJun-27-2025

The rapid progress of Large Language Models (LLMs) has transformed the field of Natural Language Processing (NLP). However, these advancements have primarily concentrated on high-resource languages, leaving many low-resource languages, particularly African languages, largely overlooked. Africa has over 2,000 languages [Ethnologue, 2025], the majority of which face significant challenges such as a lack of data, limited computational resources, insufficient NLP tools, and the absence of standardized benchmarks. This study presents a three-stage review to evaluate LLMs' current status, challenges, and prospects for African languages. The first stage investigates both commercial and open-source LLMs models with more than 7 billion parameters regarding their support for African languages [Wang et al., 2024]. The second stage examines foundational multilingual models that have significantly influenced NLP research and development.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.0228

Country:

Europe (1.00)
Africa > Middle East (0.93)
Asia > Middle East > UAE (0.46)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tokenization Matters: Improving Zero-Shot NER for Indic Languages

Pattnayak, Priyaranjan, Patel, Hitesh Laxmichand, Agarwal, Amit

arXiv.org Artificial IntelligenceApr-25-2025

Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains underexplored due to its limitations in handling morphological complexity. In this work, we systematically compare BPE, SentencePiece, and Character Level tokenization strategies using IndicBERT for NER tasks in low resource Indic languages like Assamese, Bengali, Marathi, and Odia, as well as extremely low resource Indic languages like Santali, Manipuri, and Sindhi. We assess both intrinsic linguistic properties tokenization efficiency, out of vocabulary (OOV) rates, and morphological preservation as well as extrinsic downstream performance, including fine tuning and zero shot cross lingual transfer. Our experiments show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic Languages, particularly in zero shot cross lingual settings, as it better preserves entity consistency. While BPE provides the most compact tokenization form, it is not capable of generalization because it misclassifies or even fails to recognize entity labels when tested on unseen languages. In contrast, SentencePiece constitutes a better linguistic structural preservation model, benefiting extremely low resource and morphologically rich Indic languages, such as Santali and Manipuri, for superior entity recognition, as well as high generalization across scripts, such as Sindhi, written in Arabic. The results point to SentencePiece as the more effective tokenization strategy for NER within multilingual and low resource Indic NLP applications.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.16977

Country: Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

Rusli, Andre, Shishido, Makoto

arXiv.org Artificial IntelligenceDec-23-2024

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

2412.17361

Country: Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (1.00)

Add feedback

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Suyunu, Burak, Taylan, Enes, Özgür, Arzucan

arXiv.org Artificial IntelligenceNov-26-2024

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.

protein sequence, sentencepiece, vocabulary size, (10 more...)

arXiv.org Artificial Intelligence

2411.17669

Country:

North America > United States (0.14)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Multilingual Large Language Models and Curse of Multilinguality

Gurgurov, Daniil, Bäumel, Tanja, Anikina, Tatiana

arXiv.org Artificial IntelligenceJun-15-2024

Multilingual Large Language Models (LLMs) have gained large popularity among Natural Language Processing (NLP) researchers and practitioners. These models, trained on huge datasets, show proficiency across various languages and demonstrate effectiveness in numerous downstream tasks. This paper navigates the landscape of multilingual LLMs, providing an introductory overview of their technical aspects. It explains underlying architectures, objective functions, pre-training data sources, and tokenization methods. This work explores the unique features of different model types: encoder-only (mBERT, XLM-R), decoder-only (XGLM, PALM, BLOOM, GPT-3), and encoder-decoder models (mT5, mBART). Additionally, it addresses one of the significant limitations of multilingual LLMs - the curse of multilinguality - and discusses current attempts to overcome it.

architecture, computational linguistic, multilingual llm, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.48550/arXiv.2406.10602

2406.10602

Country: