Parameterized Synthetic Text Generation with SimpleStories

Finke, Lennart, Sreedhara, Chandan, Dooms, Thomas, Allen, Mat, Zhang, Emerald, Rodriguez, Juan Diego, Nabeshima, Noa, Marshall, Thomas, Braun, Dan

Jun-3-2025–arXiv.org Artificial Intelligence

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jun-3-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.48)
  - Natural Language > Large Language Model (0.30)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found