bpe
Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxรณchtil Mixtec ASR
This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxรณchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.
Length-MAX Tokenizer for Language Models
We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.
Subword Tokenization Strategies for Kurdish Word Embeddings
Salehi, Ali, Jacobs, Cassandra L.
We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.
MorphTok: Morphologically Grounded Tokenization for Indian Languages
Brahma, Maharaj, Karthika, N J, Singh, Atul, Adiga, Devaraj, Bhate, Smruti, Ramakrishnan, Ganesh, Saluja, Rohit, Desarkar, Maunendra Sankar
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams, often leading to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step before applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves machine translation and language modeling performance. Additionally, to handle the dependent vowels common in syllable-based writing systems used by Indic languages, we propose Constrained BPE (CBPE), an extension to the standard BPE algorithm incorporating script-specific constraints. In particular, CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit. Our results show that CBPE achieves a 1.68\% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation and language modeling, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
Teklehaymanot, Hailay Kidu, Fazlija, Dren, Nejdl, Wolfgang
Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC
Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
Slyman, Eric, Tanjim, Mehrab, Kafle, Kushal, Lee, Stefan
Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt en-sembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. T o address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. W e show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge's true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
Hu, Yifan, Liang, Frank, Zhao, Dachuan, Geuter, Jonathan, Reddy, Varshini, Schmidt, Craig W., Tanner, Chris
Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Cho, Gyeongje, So, Yeonkyoun, Park, Chanwoo, Lee, Sangmin, Jung, Sungmok, Lee, Jaejin
This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Land, Sander, Arnett, Catherine
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.