fertility
- North America > United States > New York (0.05)
- North America > United States > California (0.05)
- North America > United States > South Carolina (0.04)
- (4 more...)
- Media > News (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.97)
- Health & Medicine > Therapeutic Area > Oncology (0.95)
- (2 more...)
Forecasting India's Demographic Transition Under Fertility Policy Scenarios Using hybrid LSTM-PINN Model
Khanra, Subarna, Kukreja, Vijay Kumar, Bala, Indu
Demographic forecasting remains a fundamental challenge for policy planning in rapidly evolving nations such as India, where fertility transitions, policy interventions, and age structured dynamics interact in complex ways. In this study, we present a hybrid modelling framework that integrates policy-aware fertility functions into a Physics-Informed Neural Network (PINN) enhanced with Long Short-Term Memory (LSTM) networks to capture physical constraints and temporal dependencies in population dynamics. The model is applied to India's age structured population from 2024 to 2054 under three fertility-policy scenarios: continuation of current fertility decline, stricter population control, and relaxed fertility promotion. The governing transport-reaction partial differential equation is formulated with India-specific demographic indicators, including age-specific fertility and mortality rates. PINNs embed the core population equation and policy-driven fertility changes, while LSTM layers improve long-term forecasting across decades. Results show that fertility policies substantially shape future age distribution, dependency ratios, and workforce size. Stricter controls intensify ageing and reduce labour force participation, whereas relaxed policies support workforce growth but increase population pressure. Our findings suggest that the hybrid LSTM-PINN is an effective approach for demographic forecasting, offering accuracy with interpretability. Beyond methodological novelty, this work provides actionable insights for India's demographic policy debates, highlighting the need for balanced fertility interventions to ensure sustainable socio-economic development.
- Asia > India (1.00)
- Asia > China (0.04)
- Oceania > Australia > South Australia > Adelaide (0.04)
- (3 more...)
- Health & Medicine > Public Health (1.00)
- Government (1.00)
- Banking & Finance (1.00)
- Education (0.94)
The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- Asia > Singapore (0.04)
- (21 more...)
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
Nayeem, Mir Tafseer, Alqahtani, Sawsan, Laskar, Md Tahmid Rahman, Mohiuddin, Tasnim, Bari, M Saiful
Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility's blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.
- North America > Canada > Alberta (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.05)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (4 more...)
The Token Tax: Systematic Bias in Multilingual Tokenization
Lundin, Jessica M., Zhang, Ada, Karim, Nihal, Louzan, Hamza, Wei, Victor, Adelani, David, Carroll, Cody
Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
- North America > United States > California > San Francisco County > San Francisco (0.05)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (2 more...)
- Research Report > New Finding (0.70)
- Research Report > Experimental Study (0.47)
No Translation Needed: Forecasting Quality from Fertility and Metadata
Lundin, Jessica M., Zhang, Ada, Adelani, David, Carroll, Cody
We show that translation quality can be predicted with surprising accuracy \textit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$\rightarrow$English and $R^{2}=0.72$ for English$\rightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.
- North America > United States > California > San Francisco County > San Francisco (0.05)
- Asia > Indonesia > Bali (0.05)
- Africa > Niger (0.05)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Thakur, Aamod, Nagpal, Ajay, Savarkar, Atharva, Pundalik, Kundeshwar, Dosi, Siddhesh, Sawarkar, Piyush, Thakur, Viraj, Saluja, Rohit, Desarkar, Maunendra Sankar, Ramakrishnan, Ganesh
While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs
- Asia > Middle East > Jordan (0.04)
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (5 more...)
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Cho, Gyeongje, So, Yeonkyoun, Park, Chanwoo, Lee, Sangmin, Jung, Sungmok, Lee, Jaejin
This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation
Moroni, Luca, Puccetti, Giovanni, Cabot, Pere-Lluis Huguet, Bejgu, Andrei Stefan, Barba, Edoardo, Miaschi, Alessio, Dell'Orletta, Felice, Esuli, Andrea, Navigli, Roberto
The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25\%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Comoros > Grande Comore > Moroni (0.04)
- (15 more...)
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages
Kiulian, Artur, Polishko, Anton, Khandoga, Mykola, Kostiuk, Yevhen, Gabrielli, Guillermo, Gagała, Łukasz, Zaraket, Fadi, Obaida, Qusai Abu, Garud, Hrishikesh, Mak, Wendy Wing Yee, Chaplynskyi, Dmytro, Amor, Selma Belhadj, Peradze, Grigol
In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian. Our approach demonstrates improved language performance while reducing computational costs. It mitigates the disproportionate penalization of underrepresented languages, promoting fairness and minimizing adverse phenomena such as code-switching and broken grammar. Additionally, we introduce new metrics to evaluate language quality, revealing that vocabulary size significantly impacts the quality of generated text.
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.05)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (12 more...)