Goto

Collaborating Authors

 word length


Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models

Berman, Vladimir

arXiv.org Machine Learning

We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability. Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.


Self-Organizing Language

Eugenio, P. Myles, Beavers, Anthony

arXiv.org Artificial Intelligence

We introduce a novel paradigm of emergent local memory. It is a continuous-learning completely-parallel content-addressable memory encoding global order. It demonstrates how local constraints on uncoordinated learning can produce topologically protected memories realizing emergent symbolic order. It is therefore a neuro-symbolic bridge. It further has the ability to produce human language without data, by exploiting its own self-organizing dynamics. It teaches us that words arise as a side-effect of emergent symbolic order, and that human language patterns at all structural levels reflect a universal mechanism of word formation (which is subregular). This work answers essential questions about the existence \& origin of all the human language data.


Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia

Rydel-Johnston, Hugo, Kafkas, Alex

arXiv.org Artificial Intelligence

Division of Psychology, Communication & Human Neuroscience, The University of Manchester Author Note Hugo Rydel - Johnston https://orcid.org/0009 - 0006 - 1103 - 1015 Alex Ka fkas https://orcid.org/0000 - 0001 - 5133 - 8827 We have no conflict s of interest to disclose. Correspondence concerning this article should be addressed to Hugo Rydel - Johnston, Division of Psychology, Communication & Human Neuroscience, The University of Manchester, Oxford Road, Manchester, M13 9PL, UK . DYSLEXIC READING TAKES LONGER 2 Abstract We ask where, and under what conditions, dyslexic reading costs arise in a large - scale naturalistic reading dataset. Using eye - tracking aligned to word - level properties -- word length, frequency, and predictability -- we model the influence of each of these feat ures on dyslexic time costs. We find that all three properties robustly change reading times in both typical and dyslexic readers, but dyslexic readers show stronger sensitivities to each of the three features, especially predictability. Counterfactual man ipulations of these features substantially narrow the dyslexic - control gap -- by about one - third -- with predictability showing the strongest effect, followed by length, and frequency. These patterns align with existing dyslexia theories suggesting heightened de mands on linguistic working memory and phonological encoding in dyslexic reading and directly motivate further research into lexical complexity and preview benefits to further explain the quantified gap. In effect, these findings break down when extra dysl exic costs arise, how large they are, and provide actionable guidance for the development of interventions and computational models for dyslexic readers. Keywords: e ye movements, r eading time, w ord length, l exical f requency, p redictability, s kipping, t otal reading time DYSLEXIC READING TAKES LONGER 3 Why Dyslexic Reading Takes Longer - And When Dyslexia is characterized by persistent difficulty in accurate and/or fluent word recognition and decoding (Lyon et al., 2003) and affects between 4 - 8% of individuals (Yang et al., 2022; Doust et al., 2022).


An experimental and computational study of an Estonian single-person word naming

Lõo, Kaidi, Tavast, Arvi, Heitmeier, Maria, Baayen, Harald

arXiv.org Artificial Intelligence

This study investigates lexical processing in Estonian. A large-scale single-subject experiment is reported that combines the word naming task with eye-tracking. Five response variables (first fixation duration, total fixation duration, number of fixations, word naming latency, and spoken word duration) are analyzed with the generalized additive model. Of central interest is the question of whether measures for lexical processing generated by a computational model of the mental lexicon (the Discriminative Lexicon Model, DLM) are predictive for these response variables, and how they compare to classical predictors such as word frequency, neighborhood size, and inflectional paradigm size. Computational models were implemented both with linear and deep mappings. Central findings are, first, that DLM-based measures are powerful predictors for lexical processing, second, that DLM-measures using deep learning are not necessarily more precise predictors of lexical processing than DLM-measures using linear mappings, third, that classical predictors tend to provide somewhat more precise fits compared to DLM-based predictors (except for total fixation duration, where the two provide equivalent goodness of fit), and fourth, that in the naming task lexical variables are not predictive for first fixation duration and the total number of fixations. As the DLM works with mappings from form to meaning, the predictivity of DLM-based measures for total fixation duration, naming latencies, and spoken word duration indicates that meaning is heavily involved in the present word naming task.


CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Uzan, Omri, Pinter, Yuval

arXiv.org Artificial Intelligence

Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models' reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.


The Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect

Iaia, Cosimo, Choksi, Bhavin, Wiebers, Emily, Roig, Gemma, Fiebach, Christian J.

arXiv.org Artificial Intelligence

The nouns of our language refer to either concrete entities (like a table) or abstract concepts (like justice or love), and cognitive psychology has established that concreteness influences how words are processed. Accordingly, understanding how concreteness is represented in our mind and brain is a central question in psychology, neuroscience, and computational linguistics. While the advent of powerful language models has allowed for quantitative inquiries into the nature of semantic representations, it remains largely underexplored how they represent concreteness. Here, we used behavioral judgments to estimate semantic distances implicitly used by humans, for a set of carefully selected abstract and concrete nouns. Using Representational Similarity Analysis, we find that the implicit representational space of participants and the semantic representations of language models are significantly aligned. We also find that both representational spaces are implicitly aligned to an explicit representation of concreteness, which was obtained from our participants using an additional concreteness rating task. Importantly, using ablation experiments, we demonstrate that the human-to-model alignment is substantially driven by concreteness, but not by other important word characteristics established in psycholinguistics. These results indicate that humans and language models converge on the concreteness dimension, but not on other dimensions.


Numerical Words and Linguistic Loops: The Perpetual Four-Letter Routine

Polavaram, Krishna Chaitanya

arXiv.org Artificial Intelligence

This study presents a fascinating linguistic property related to the number of letters in words and their corresponding numerical values. By selecting any arbitrary word, counting its constituent letters, and subsequently spelling out the resulting count and tallying the letters anew, an unanticipated pattern is observed. Remarkably, this iterative sequence, conducted on a dataset of 100,000 random words, invariably converges to the numeral four (4), termed the "Linguistic Loop (LL) constant". Examining 73 languages utilizing the Latin alphabet, this research reveals distinctive patterns. Among them, 28 languages exhibit LL-positive behavior adhering to the established property, while 31 languages deviate as LL-negative. Additionally, 13 languages display nuanced tendencies: eight feature two LL constants (bi-positivity), and five feature three constants (tri-positivity). This discovery highlights a linguistic quirk within Latin alphabet-based language number-word representations, uncovering an intriguing facet across diverse alphabetic systems. It also raises questions about the underlying linguistic and cognitive mechanisms responsible for this phenomenon.


Learning the symmetric group: large from small

Petschack, Max, Garbali, Alexandr, de Gier, Jan

arXiv.org Artificial Intelligence

Machine learning explorations can make significant inroads into solving difficult problems in pure mathematics. One advantage of this approach is that mathematical datasets do not suffer from noise, but a challenge is the amount of data required to train these models and that this data can be computationally expensive to generate. Key challenges further comprise difficulty in a posteriori interpretation of statistical models and the implementation of deep and abstract mathematical problems. We propose a method for scalable tasks, by which models trained on simpler versions of a task can then generalize to the full task. Specifically, we demonstrate that a transformer neural-network trained on predicting permutations from words formed by general transpositions in the symmetric group $S_{10}$ can generalize to the symmetric group $S_{25}$ with near 100\% accuracy. We also show that $S_{10}$ generalizes to $S_{16}$ with similar performance if we only use adjacent transpositions. We employ identity augmentation as a key tool to manage variable word lengths, and partitioned windows for training on adjacent transpositions. Finally we compare variations of the method used and discuss potential challenges with extending the method to other tasks.


Language Models Largely Exhibit Human-like Constituent Ordering Preferences

Tur, Ada Defne, Kamath, Gaurav, Reddy, Siva

arXiv.org Artificial Intelligence

Though English sentences are typically inflexible vis-\`a-vis word order, constituents often show far more variability in ordering. One prominent theory presents the notion that constituent ordering is directly correlated with constituent weight: a measure of the constituent's length or complexity. Such theories are interesting in the context of natural language processing (NLP), because while recent advances in NLP have led to significant gains in the performance of large language models (LLMs), much remains unclear about how these models process language, and how this compares to human language processing. In particular, the question remains whether LLMs display the same patterns with constituent movement, and may provide insights into existing theories on when and how the shift occurs in human language. We compare a variety of LLMs with diverse properties to evaluate broad LLM performance on four types of constituent movement: heavy NP shift, particle movement, dative alternation, and multiple PPs. Despite performing unexpectedly around particle movement, LLMs generally align with human preferences around constituent ordering.


Disentanglement and Compositionality of Letter Identity and Letter Position in Variational Auto-Encoder Vision Models

Bianchi, Bruno, Agrawal, Aakash, Dehaene, Stanislas, Chemla, Emmanuel, Lakretz, Yair

arXiv.org Artificial Intelligence

Human readers can accurately count how many letters are in a word (e.g., 7 in ``buffalo''), remove a letter from a given position (e.g., ``bufflo'') or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create and parse new strings, with any combination of letters appearing in any positions. Do modern deep neural models also possess this crucial compositional ability? Here, we tested whether neural models that achieve state-of-the-art on disentanglement of features in visual input can also disentangle letter position and letter identity when trained on images of written words. Specifically, we trained beta variational autoencoder ($\beta$-VAE) to reconstruct images of letter strings and evaluated their disentanglement performance using CompOrth - a new benchmark that we created for studying compositional learning and zero-shot generalization in visual models for orthography. The benchmark suggests a set of tests, of increasing complexity, to evaluate the degree of disentanglement between orthographic features of written words in deep neural models. Using CompOrth, we conducted a set of experiments to analyze the generalization ability of these models, in particular, to unseen word length and to unseen combinations of letter identities and letter positions. We found that while models effectively disentangle surface features, such as horizontal and vertical `retinal' locations of words within an image, they dramatically fail to disentangle letter position and letter identity and lack any notion of word length. Together, this study demonstrates the shortcomings of state-of-the-art $\beta$-VAE models compared to humans and proposes a new challenge and a corresponding benchmark to evaluate neural models.