Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text
Velasco, Dan John, Roque, Matthew Theodore
–arXiv.org Artificial Intelligence
Most languages lack sufficient data for large-scale monolingual pretraining, creating a "data wall." Multilingual pretraining helps but is limited by language imbalance and the "curse of multilinguality." An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil--two typologically distant, lower-resource languages--and pretraining GPT-2 models (124M-774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.
arXiv.org Artificial Intelligence
Sep-23-2025
- Country:
- Africa > Zambia (0.04)
- Asia
- China (0.04)
- Indonesia > Bali (0.04)
- Middle East
- Jordan (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Philippines (0.04)
- Russia (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Finland > Uusimaa
- Helsinki (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Russia (0.04)
- Spain (0.04)
- Finland > Uusimaa
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Florida > Miami-Dade County
- Oceania > New Zealand (0.04)
- South America > Venezuela
- Capital District > Caracas (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Government > Regional Government
- Health & Medicine
- Epidemiology (0.68)
- Therapeutic Area
- Immunology (0.68)
- Infections and Infectious Diseases (0.68)
- Law (0.68)
- Transportation (0.68)
- Technology: