Parameterized Synthetic Text Generation with SimpleStories
Finke, Lennart, Sreedhara, Chandan, Dooms, Thomas, Allen, Mat, Zhang, Emerald, Rodriguez, Juan Diego, Nabeshima, Noa, Marshall, Thomas, Braun, Dan
–arXiv.org Artificial Intelligence
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.
arXiv.org Artificial Intelligence
Jun-3-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Belgium > Flanders
- Antwerp Province > Antwerp (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Belgium > Flanders
- North America > United States
- Florida > Miami-Dade County > Miami (0.04)
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Technology: