Model-Aware Tokenizer Transfer
Haltiuk, Mykola, Smywiński-Pohl, Aleksander
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-A ware Tokenizer Transfer (MA TT), a method that incorporates model internals into the tokenizer transfer process. MA TT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MA TT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MA TT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs. Recent advances in large language models (LLMs) have shifted attention from training monolingual models (Jiang et al., 2023; Touvron et al., 2023) to covering an increasing number of languages (Grattafiori et al., 2024; Team et al., 2025).
arXiv.org Artificial Intelligence
Oct-28-2025
- Country:
- Asia
- Europe
- Austria > Vienna (0.15)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Poland > Lesser Poland Province
- Kraków (0.04)
- Slovenia (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States > Washington
- King County > Seattle (0.04)
- Canada > Ontario
- South America
- Colombia > Meta Department
- Villavicencio (0.04)
- Paraguay > Asunción
- Asunción (0.04)
- Colombia > Meta Department
- Genre:
- Research Report (1.00)
- Technology: