MGen: Millions of Naturally Occurring Generics in Context

Cilleruelo, Gustavo, Allaway, Emily, Haddow, Barry, Birch, Alexandra

Nov-25-2025–arXiv.org Artificial Intelligence

MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Nov-25-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia > Middle East
  - UAE (0.28)

Genre:
- Research Report (0.64)

Industry:
- Health & Medicine
  - Consumer Health (1.00)
  - Therapeutic Area
    - Infections and Infectious Diseases (0.68)
    - Immunology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning (0.68)
  - Natural Language > Text Processing (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found