Goto

Collaborating Authors

 polysemy



FIRE: Semantic Field of Words Represented as Non-Linear Functions

Neural Information Processing Systems

State-of-the-art word embeddings presume a linear vector space, but this approach does not easily incorporate the nonlinearity that is necessary to represent polysemy. We thus propose a novel semantic FIeld REepresentation, called FIRE, which is a $D$-dimensional field in which every word is represented as a set of its locations and a nonlinear function covering the field. The strength of a word's relation to another word at a certain location is measured as the function value at that location. With FIRE, compositionality is represented via functional additivity, whereas polysemy is represented via the set of points and the function's multimodality. By implementing FIRE for English and comparing it with previous representation methods via word and sentence similarity tasks, we show that FIRE produces comparable or even better results. In an evaluation of polysemy to predict the number of word senses, FIRE greatly outperformed BERT and Word2vec, providing evidence of how FIRE represents polysemy. The code is available at https://github.com/kduxin/firelang.


Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

Kugler, Kai

arXiv.org Artificial Intelligence

We present the first systematic investigation of Martin's Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin's Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.


Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals

Wu, Jingni, Zeldes, Amir

arXiv.org Artificial Intelligence

Discourse markers (DMs) like 'but' or 'then' are crucial for creating coherence in discourse, yet they are often replaced by or co-occur with non-DMs ('in the morning' can mean the same as 'then'), and both can be ambiguous ('since' can refer to time or cause). The interaction mechanism between such signals remains unclear but pivotal for their disambiguation. In this paper we investigate the relationship between DM polysemy and co-occurrence of non-DM signals in English, as well as the influence of genre on these patterns. Using the framework of eRST, we propose a graded definition of DM polysemy, and conduct correlation and regression analyses to examine whether polysemous DMs are accompanied by more numerous and diverse non-DM signals. Our findings reveal that while polysemous DMs do co-occur with more diverse non-DMs, the total number of co-occurring signals does not necessarily increase. Moreover, genre plays a significant role in shaping DM-signal interactions.



Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces

Pichat, Michael, Pogrund, William, Pichat, Paloma, Poumay, Judicael, Gasparian, Armanouche, Demarchi, Samuel, Corbet, Martin, Georgeon, Alois, Veillet-Guillem, Michael

arXiv.org Artificial Intelligence

The polysemantic nature of synthetic neurons in artificial intelligence language models is currently understood as the result of a necessary superposition of distributed features within the latent space. We propose an alternative approach, geometrically defining a neuron in layer n as a categorical vector space with a non-orthogonal basis, composed of categorical sub-dimensions extracted from preceding neurons in layer n-1. This categorical vector space is structured by the activation space of each neuron and enables, via an intra-neuronal attention process, the identification and utilization of a critical categorical zone for the efficiency of the language model - more homogeneous and located at the intersection of these different categorical sub-dimensions.


The cell as a token: high-dimensional geometry in language models and cell embeddings

Gilpin, William

arXiv.org Artificial Intelligence

This process mirrors parallel developments in machine learning, where large language models ingest unstructured text by converting words into discrete tokens embedded within a high-dimensional vector space. This perspective explores how advances in understanding the structure of language embeddings can inform ongoing efforts to analyze and visualize single cell datasets. We discuss how the context of tokens influences the geometry of embedding space, and the role of low-dimensional manifolds in shaping this space's robustness and interpretability. We highlight new developments in language modeling, such as interpretability probes and in-context reasoning, that can inform future efforts to construct and consolidate cell atlases. The implicit goal of modern single-cell technologies is to decompile the cell--to abstract it away from its squishy context, and to render it as a single point in a high-dimensional vector space. But how do we know if this space is meaningful?


FIRE: Semantic Field of Words Represented as Non-Linear Functions

Neural Information Processing Systems

State-of-the-art word embeddings presume a linear vector space, but this approach does not easily incorporate the nonlinearity that is necessary to represent polysemy. We thus propose a novel semantic FIeld REepresentation, called FIRE, which is a D -dimensional field in which every word is represented as a set of its locations and a nonlinear function covering the field. The strength of a word's relation to another word at a certain location is measured as the function value at that location. With FIRE, compositionality is represented via functional additivity, whereas polysemy is represented via the set of points and the function's multimodality. By implementing FIRE for English and comparing it with previous representation methods via word and sentence similarity tasks, we show that FIRE produces comparable or even better results.


Tracing the Development of the Virtual Particle Concept Using Semantic Change Detection

Zichert, Michael, Wüthrich, Adrian

arXiv.org Artificial Intelligence

Virtual particles are peculiar objects. They figure prominently in much of theoretical and experimental research in elementary particle physics. But exactly what they are is far from obvious. In particular, to what extent they should be considered "real" remains a matter of controversy in philosophy of science. Also their origin and development has only recently come into focus of scholarship in the history of science. In this study, we propose using the intriguing case of virtual particles to discuss the efficacy of Semantic Change Detection (SCD) based on contextualized word embeddings from a domain-adapted BERT model in studying specific scientific concepts. We find that the SCD metrics align well with qualitative research insights in the history and philosophy of science, as well as with the results obtained from Dependency Parsing to determine the frequency and connotations of the term "virtual". Still, the metrics of SCD provide additional insights over and above the qualitative research and the Dependency Parsing. Among other things, the metrics suggest that the concept of the virtual particle became more stable after 1950 but at the same time also more polysemous.


Locally Measuring Cross-lingual Lexical Alignment: A Domain and Word Level Perspective

Karidi, Taelin, Grossman, Eitan, Abend, Omri

arXiv.org Artificial Intelligence

NLP research on aligning lexical representation spaces to one another has so far focused on aligning language spaces in their entirety. However, cognitive science has long focused on a local perspective, investigating whether translation equivalents truly share the same meaning or the extent that cultural and regional influences result in meaning variations. With recent technological advances and the increasing amounts of available data, the longstanding question of cross-lingual lexical alignment can now be approached in a more data-driven manner. However, developing metrics for the task requires some methodology for comparing metric efficacy. We address this gap and present a methodology for analyzing both synthetic validations and a novel naturalistic validation using lexical gaps in the kinship domain. We further propose new metrics, hitherto unexplored on this task, based on contextualized embeddings. Our analysis spans 16 diverse languages, demonstrating that there is substantial room for improvement with the use of newer language models. Our research paves the way for more accurate and nuanced cross-lingual lexical alignment methodologies and evaluation.