Goto

Collaborating Authors

 lexeme


OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Bommarito, Michael J. II

arXiv.org Artificial Intelligence

We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.


The aftermath of compounds: Investigating Compounds and their Semantic Representations

Joshi, Swarang

arXiv.org Artificial Intelligence

This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.


Unsupervised Learning by Program Synthesis

Kevin Ellis, Armando Solar-Lezama, Josh Tenenbaum

Neural Information Processing Systems

We introduce an unsupervised learning algorithm that combines probabilistic modeling with solver-based techniques for program synthesis. We apply our techniques to both a visual learning domain and a language learning problem, showing that our algorithm can learn many visual concepts from only a few examples and that it can recover some English inflectional morphology. Taken together, these results give both a new approach to unsupervised learning of symbolic compositional structures, and a technique for applying program synthesis tools to noisy data.


Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

Sennrich, Kilian, Ahmadi, Sina

arXiv.org Artificial Intelligence

Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.


Using Language Models to Disambiguate Lexical Choices in Translation

Barua, Josh, Subramanian, Sanjay, Yin, Kayo, Suhr, Alane

arXiv.org Artificial Intelligence

In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.


The natural stability of autonomous morphology

Round, Erich, Esher, Louise, Beniamine, Sacha

arXiv.org Artificial Intelligence

Autonomous morphology, such as inflection class systems and paradigmatic distribution patterns, is widespread and diachronically resilient in natural language. Why this should be so has remained unclear given that autonomous morphology imposes learning costs, offers no clear benefit relative to its absence and could easily be removed by the analogical forces which are constantly reshaping it. Here we propose an explanation for the resilience of autonomous morphology, in terms of a diachronic dynamic of attraction and repulsion between morphomic categories, which emerges spontaneously from a simple paradigm cell filling process. Employing computational evolutionary models, our key innovation is to bring to light the role of `dissociative evidence', i.e., evidence for inflectional distinctiveness which a rational reasoner will have access to during analogical inference. Dissociative evidence creates a repulsion dynamic which prevents morphomic classes from collapsing together entirely, i.e., undergoing complete levelling. As we probe alternative models, we reveal the limits of conditional entropy as a measure for predictability in systems that are undergoing change. Finally, we demonstrate that autonomous morphology, far from being `unnatural' (e.g. \citealt{Aronoff1994}), is rather the natural (emergent) consequence of a natural (rational) process of inference applied to inflectional systems.


Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language

Alnajjar, Khalid, Hämäläinen, Mika, Rueter, Jack

arXiv.org Artificial Intelligence

This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation, and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.


Unsupervised Learning by Program Synthesis

Neural Information Processing Systems

We introduce an unsupervised learning algorithm that combines probabilistic modeling with solver-based techniques for program synthesis. We apply our techniques to both a visual learning domain and a language learning problem, showing that our algorithm can learn many visual concepts from only a few examples and that it can recover some English inflectional morphology. Taken together, these results give both a new approach to unsupervised learning of symbolic compositional structures, and a technique for applying program synthesis tools to noisy data.


Constrained Decoding for Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars

Melcer, Daniel, Fulton, Nathan, Gouda, Sanjay Krishna, Qian, Haifeng

arXiv.org Artificial Intelligence

Large Language Models are powerful tools for program synthesis and advanced auto-completion, but come with no guarantee that their output code is syntactically correct. This paper contributes an incremental parser that allows early rejection of syntactically incorrect code, as well as efficient detection of complete programs for fill-in-the-middle (FItM) tasks. We develop Earley-style parsers that operate over left and right quotients of arbitrary context-free grammars, and we extend our incremental parsing and quotient operations to several context-sensitive features present in the grammars of many common programming languages. The result of these contributions is an efficient, general, and well-grounded method for left and right quotient parsing. To validate our theoretical contributions -- and the practical effectiveness of certain design decisions -- we evaluate our method on the particularly difficult case of FItM completion for Python 3. Our results demonstrate that constrained generation can significantly reduce the incidence of syntax errors in recommended code.


Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals

Khairallah, Christian, Marzouk, Reham, Khalifa, Salam, Nassar, Mayar, Habash, Nizar

arXiv.org Artificial Intelligence

Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously. This paper attempts to define the space of such challenges, and leverage a recently proposed morphological framework to build a comprehensive and extensible model for MSA nominals. Our model design addresses the nominals' intricate morphotactics, as well as their paradigmatic irregularities. Our implementation showcases enhanced accuracy and consistency compared to a commonly used MSA morphological analyzer and generator. We make our models publicly available.