lexeme
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (15 more...)
- Law > Intellectual Property & Technology Law (0.68)
- Education > Educational Technology (0.46)
The aftermath of compounds: Investigating Compounds and their Semantic Representations
This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.
Unsupervised Learning by Program Synthesis
Kevin Ellis, Armando Solar-Lezama, Josh Tenenbaum
We introduce an unsupervised learning algorithm that combines probabilistic modeling with solver-based techniques for program synthesis. We apply our techniques to both a visual learning domain and a language learning problem, showing that our algorithm can learn many visual concepts from only a few examples and that it can recover some English inflectional morphology. Taken together, these results give both a new approach to unsupervised learning of symbolic compositional structures, and a technique for applying program synthesis tools to noisy data.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.82)
Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language
Sennrich, Kilian, Ahmadi, Sina
Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.
- Europe > Switzerland > Zürich > Zürich (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Using Language Models to Disambiguate Lexical Choices in Translation
Barua, Josh, Subramanian, Sanjay, Yin, Kayo, Suhr, Alane
In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- Asia > India (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
The natural stability of autonomous morphology
Round, Erich, Esher, Louise, Beniamine, Sacha
Autonomous morphology, such as inflection class systems and paradigmatic distribution patterns, is widespread and diachronically resilient in natural language. Why this should be so has remained unclear given that autonomous morphology imposes learning costs, offers no clear benefit relative to its absence and could easily be removed by the analogical forces which are constantly reshaping it. Here we propose an explanation for the resilience of autonomous morphology, in terms of a diachronic dynamic of attraction and repulsion between morphomic categories, which emerges spontaneously from a simple paradigm cell filling process. Employing computational evolutionary models, our key innovation is to bring to light the role of `dissociative evidence', i.e., evidence for inflectional distinctiveness which a rational reasoner will have access to during analogical inference. Dissociative evidence creates a repulsion dynamic which prevents morphomic classes from collapsing together entirely, i.e., undergoing complete levelling. As we probe alternative models, we reveal the limits of conditional entropy as a measure for predictability in systems that are undergoing change. Finally, we demonstrate that autonomous morphology, far from being `unnatural' (e.g. \citealt{Aronoff1994}), is rather the natural (emergent) consequence of a natural (rational) process of inference applied to inflectional systems.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (11 more...)
Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language
Alnajjar, Khalid, Hämäläinen, Mika, Rueter, Jack
This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation, and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.
- Europe > Finland > Uusimaa > Helsinki (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
Unsupervised Learning by Program Synthesis
We introduce an unsupervised learning algorithm that combines probabilistic modeling with solver-based techniques for program synthesis. We apply our techniques to both a visual learning domain and a language learning problem, showing that our algorithm can learn many visual concepts from only a few examples and that it can recover some English inflectional morphology. Taken together, these results give both a new approach to unsupervised learning of symbolic compositional structures, and a technique for applying program synthesis tools to noisy data.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.82)
Constrained Decoding for Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars
Melcer, Daniel, Fulton, Nathan, Gouda, Sanjay Krishna, Qian, Haifeng
Large Language Models are powerful tools for program synthesis and advanced auto-completion, but come with no guarantee that their output code is syntactically correct. This paper contributes an incremental parser that allows early rejection of syntactically incorrect code, as well as efficient detection of complete programs for fill-in-the-middle (FItM) tasks. We develop Earley-style parsers that operate over left and right quotients of arbitrary context-free grammars, and we extend our incremental parsing and quotient operations to several context-sensitive features present in the grammars of many common programming languages. The result of these contributions is an efficient, general, and well-grounded method for left and right quotient parsing. To validate our theoretical contributions -- and the practical effectiveness of certain design decisions -- we evaluate our method on the particularly difficult case of FItM completion for Python 3. Our results demonstrate that constrained generation can significantly reduce the incidence of syntax errors in recommended code.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (4 more...)
Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals
Khairallah, Christian, Marzouk, Reham, Khalifa, Salam, Nassar, Mayar, Habash, Nizar
Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously. This paper attempts to define the space of such challenges, and leverage a recently proposed morphological framework to build a comprehensive and extensible model for MSA nominals. Our model design addresses the nominals' intricate morphotactics, as well as their paradigmatic irregularities. Our implementation showcases enhanced accuracy and consistency compared to a commonly used MSA morphological analyzer and generator. We make our models publicly available.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
- Europe > Czechia > Prague (0.04)
- (15 more...)