Heaps' Law in GPT-Neo Large Language Model Emulated Corpora

Lai, Uyen, Randhawa, Gurjit S., Sheridan, Paul

Nov-10-2023–arXiv.org Artificial Intelligence

Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the content up to the original abstract's length. Our findings indicate that the generated corpora adhere to Heaps' law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps' law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition.

corpora, corpus, heap, (11 more...)

arXiv.org Artificial Intelligence

Nov-10-2023

arXiv.org PDF

Add feedback

Country:
- North America > Canada > Prince Edward Island > Queens County > Charlottetown (0.05)

Genre:
- Research Report > New Finding (0.49)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found