tinystory
- North America > United States (0.28)
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec (0.04)
- (4 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (3 more...)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models
Lee, Ivan, Berg-Kirkpatrick, Taylor
Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability -- characterized by accessible vocabulary, familiar narrative structure, and simple syntax -- plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training -- drawing parallels to human cognitive development without empirical basis -- and argue for more precise reasoning about what properties actually support capability emergence in small models.
- North America > United States > Florida > Miami-Dade County > Miami (0.14)
- Asia > Russia (0.14)
- North America > United States > New Hampshire (0.04)
- (29 more...)
- Media (1.00)
- Leisure & Entertainment > Sports (1.00)
- Law (1.00)
- (6 more...)
- North America > United States (0.28)
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec (0.04)
- (4 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (3 more...)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.67)
Measuring LLM Novelty As The Frontier Of Original And High-Quality Output
Padmakumar, Vishakh, Yueh-Han, Chen, Pan, Jane, Chen, Valerie, He, He
As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as originality with respect to model training data, but original outputs may be of low quality. In contrast, non-expert judges more reliably score quality but may favor memorized outputs, limiting the reliability of human preference as a metric. We introduce a new novelty metric for LLM generations that balances originality and quality -- the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. Using this framework, we identify trends that affect the novelty of generations from three families of open-data models (OLMo, OLMo-2, and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that model-generated text from some base LLMs is less novel than human-written text from the internet. However, increasing model scale and post-training reliably improves novelty due to improvements in output quality. We also find that improving the base model at the same scale (\eg OLMo 7B to OLMo-2 7B) leads to higher novelty due to higher originality. Finally, we observe that inference-time methods, such as prompting and providing novel in-context examples, have a much smaller effect on novelty, often increasing originality at the expense of quality. This highlights the need for further research into more effective elicitation strategies as we use models for creative applications.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report (1.00)
- Overview (0.68)
- Health & Medicine (1.00)
- Consumer Products & Services > Personal Products (0.46)
Parameterized Synthetic Text Generation with SimpleStories
Finke, Lennart, Sreedhara, Chandan, Dooms, Thomas, Allen, Mat, Zhang, Emerald, Rodriguez, Juan Diego, Nabeshima, Noa, Marshall, Thomas, Braun, Dan
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Belgium > Flanders > Antwerp Province > Antwerp (0.04)
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training
Theodoropoulos, Nikitas, Filandrianos, Giorgos, Lyberatos, Vassilis, Lymperaiou, Maria, Stamou, Giorgos
We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using TinyStories: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of TinyStories, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train LTG-BERT encoder models on a combined dataset of: a subset of TinyStories, story completions generated by GPT-Neo, and a subset of the BabyLM dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- Europe > Slovenia (0.04)
- (3 more...)