Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Patil, Nirvan, Inamdar, Malhar Abhay, Gosai, Agnivo, Pathak, Guruprasad, Joshi, Anish, Sagavekar, Aryan, Joshirao, Anish, Dandekar, Raj, Dandekar, Rajat, Panat, Sreedath
–arXiv.org Artificial Intelligence
The 2023 TinyStories study developed an English dataset that allows Small Language Models (SLMs) with 1-10 million parameters to produce coherent outputs matching those of LLMs. Our research expands this framework by creating translated as well as synthetically generated datasets in Indian languages. Using this new dataset, we demonstrate that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, and additionally offer a complementary framework for "inference-based evaluation" of tokenization strategies and linguistic complexity. Our analysis reveals that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provide insights into the superior performance of Hindi models over Marathi and Bengali. The study uncovers distinct cross-linguistic patterns: Bengali emphasizes creativity, Hindi excels in context understanding and grammar with model scaling, and Marathi requires larger models to capture its unique linguistic features. Optimal parameter allocation varies, with Hindi benefiting more from wider architectures and Bengali favoring a balanced approach. We also show that quality synthetic datasets outperform translated content for training SLMs by 15-30 % . These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.
arXiv.org Artificial Intelligence
Apr-23-2025
- Country:
- Asia > India
- Maharashtra > Pune (0.04)
- Europe
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Asia > India
- Genre:
- Research Report > New Finding (0.93)
- Technology: