Parameterized Synthetic Text Generation with SimpleStories

Finke, Lennart, Sreedhara, Chandan, Dooms, Thomas, Allen, Mat, Zhang, Emerald, Rodriguez, Juan Diego, Nabeshima, Noa, Marshall, Thomas, Braun, Dan

arXiv.org Artificial Intelligence 

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.