OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
–arXiv.org Artificial Intelligence
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
arXiv.org Artificial Intelligence
Nov-25-2025
- Country:
- Asia
- Middle East
- Jordan (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Middle East
- Europe
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Germany (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Netherlands > South Holland
- Dordrecht (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Bulgaria > Sofia City Province
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- United States
- California > Alameda County
- Berkeley (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York > New York County
- New York City (0.04)
- Washington > King County
- Seattle (0.04)
- California > Alameda County
- Canada
- Asia
- Genre:
- Research Report (0.64)
- Industry:
- Education > Educational Technology (0.46)
- Law > Intellectual Property & Technology Law (0.68)
- Technology: