byte pair encoding
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
Patwary, Firoj Ahmmed, Noman, Abdullah Al
Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are mostly designed for Latin or multilingual corpora and do not perform well on languages with rich morphology such as Bengali. To address this limitation, we present BengaliBPE, a Byte Pair Encoding (BPE) tokenizer specifically developed for the Bengali script. BengaliBPE applies Unicode normalization, grapheme-level initialization, and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity. We use a large-scale Bengali news classification dataset to compare BengaliBPE with three baselines: Whitespace, SentencePiece BPE, and HuggingFace BPE. The evaluation considers tokenization granularity, encoding speed, and downstream classification accuracy. While all methods perform reasonably well, BengaliBPE provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost. These findings highlight the importance of language-aware tokenization for morphologically rich scripts and establish BengaliBPE as a strong foundation for future Bengali NLP systems, including large-scale pretraining of contextual language models.
Tokens with Meaning: A Hybrid Tokenization Approach for NLP
Bayram, M. Ali, Fincan, Ali Arda, Gรผmรผล, Ahmet Semih, Karakaล, Sercan, Diri, Banu, Yฤฑldฤฑrฤฑm, Savaล, รelik, Demircan
Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitabฤฑ), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.
Comparative analysis of subword tokenization approaches for Indian languages
Das, Sudhansu Bala, Choudhury, Samujjal, Mishra, Tapas Kumar, Patra, Bidyut Kr.
Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words into smaller subword units, which is especially beneficial in languages with complicated morphology or a vast vocabulary. It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations. These languages frequently use agglutinative structures, in which words are formed by the combination of multiple morphemes such as suffixes, prefixes, and stems. As a result, a suitable tokenization strategy must be chosen to address these scenarios. This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece Tokenization, affect ILs. The effectiveness of these subword tokenization techniques is investigated in statistical, neural, and multilingual neural machine translation models. All models are examined using standard evaluation metrics, such as the Bilingual Evaluation Understudy (BLEU) score, TER, METEOR, CHRF, RIBES, and COMET. Based on the results, it appears that for the majority of language pairs for the Statistical and Neural MT models, the SentencePiece tokenizer continuously performed better than other tokenizers in terms of BLEU score. However, BPE tokenization outperformed other tokenization techniques in the context of Multilingual Neural Machine Translation model. The results show that, despite using the same tokenizer and dataset for each model, translations from ILs to English surpassed translations from English to ILs.
Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
Elsner, Tim, Usinger, Paula, Nehring-Wirxel, Julius, Kobsik, Gregor, Czech, Victor, He, Yanjiang, Lim, Isaak, Kobbelt, Leif
In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, without global content awareness. Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. The multidimensionality only increases the computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences are shorter, with more uniformly distributed information content, e.g. condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to process. We additionally introduce a strategy to amplify this compression further by clustering the vocabulary.
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
Lian, Haoran, Xiong, Yizhe, Niu, Jianwei, Mo, Shasha, Su, Zhenpeng, Lin, Zijia, Liu, Peng, Chen, Hui, Ding, Guiguang
Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while keeping all tokens that have been merged in the vocabulary, it unavoidably holds tokens that primarily represent subwords of complete words and appear infrequently on their own in the text corpus. We term such tokens as Scaffold Tokens. Due to their infrequent appearance in the text corpus, Scaffold Tokens pose a learning imbalance issue for language models. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for the given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness and superiority.
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge in various natural language and speech processing tasks. Recent research highlights the dependency of BPE subword tokenization's efficacy on the morphological nature of the language, particularly in languages rich in inflectional morphology, where fewer BPE merges suffice for generating highly productive tokens. Motivated by this, our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity, thus enhancing out-of-distribution automatic speech recognition (ASR) performance. Experimental evaluation reveals that an excessively high number of BPE tokens can lead to overfitting, while approximately 500-1000 tokens result in superior OOV performance. Furthermore, we conduct a comparative analysis of BPE with character-based and unigram-based tokenization methods. By introducing BPE tokenization to Bengali ASR, we achieve a substantial reduction in the word error rate (WER) from 66.44% in our character-based baseline system to 63.80% on the LB-ASRTD eval set and from 46.34% to 42.80% on the SHRUTI eval set, both of which include out-of-distribution data.
Byte Pair Encoding for Symbolic Music
Fradet, Nathan, Gutowski, Nicolas, Chhel, Fabien, Briot, Jean-Pierre
When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different approaches, as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations rely on small vocabularies of tokens describing the note attributes and time events, resulting in fairly long token sequences, and a sub-optimal use of the embedding space of language models. Recent research has put efforts on reducing the overall sequence length by merging embeddings or combining tokens. In this paper, we show that Byte Pair Encoding, a compression technique widely used for natural language, significantly decreases the sequence length while increasing the vocabulary size. By doing so, we leverage the embedding capabilities of such models with more expressive tokens, resulting in both better results and faster inference in generation and classification tasks. The source code is shared on Github, along with a companion website. Finally, BPE is directly implemented in MidiTok, allowing the reader to easily benefit from this method.
How Byte Pair Encoding works part2(Natural Language Processing)
Abstract: Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of our genetic algorithm contains multi objectives and is solved based on evolutionary procedure including crossover and mutation operation. In the fitness function, we take the length of generated regular expression, the maximum matching characters and samples for positive training samples, and the minimum matching characters and samples for negative training samples into consideration.
The Evolution of Tokenization โ Byte Pair Encoding in NLP - KDnuggets
NLP may have been a little late to the AI epiphany but it is doing wonders with organisations like Google, OpenAI releasing state-of-the-art(SOTA) language models like BERT and GPT-2/3 respectively. GitHub Copilot and OpenAI codex are among a few very popular applications that are in the news. As someone who has very limited exposure to NLP, I decided to take up NLP as an area of research and the next few blogs/videos will be me sharing what I learn after dissecting some important components of NLP. Top Deep Learning models like BERT, GPT-2, or GPT-3 all share the same components but with different architectures that distinguish one model from another. In this newsletter(and notebook), we are going to focus on the basics of the first component of an NLP pipeline which is tokenization.
The evolution of Tokenization in NLP -- Byte Pair Encoding in NLP
NLP may have been a little late to the AI epiphany but it is doing wonders with organizations like Google, OpenAI releasing state-of-the-art(SOTA) language models like BERT and GPT-2/3 respectively. GitHub Copilot and OpenAI codex are among a few very popular applications that are in the news. As someone who has very limited exposure to NLP, I decided to take up NLP as an area of research and the next few blogs/videos will be me sharing what I learn after dissecting some important components of NLP. Top Deep Learning models like BERT, GPT-2, or GPT-3 all share the same components but with different architectures that distinguish one model from another. In this newsletter(and notebook), we are going to focus on the basics of the first component of an NLP pipeline which is tokenization.