word frequency
Zipfian Whitening
The word embedding space in neural models is skewed, and correcting this can improve task performance.We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are; in reality, word frequencies follow a highly non-uniform distribution, known as .Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines.From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures.By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective (Oyama et al., EMNLP 2023), and in terms of the loss functions for imbalanced classification (Menon et al.
Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity
Tokareva, Anastasiia, Dineley, Judith, Firth, Zoe, Conde, Pauline, Matcham, Faith, Siddi, Sara, Lamers, Femke, Carr, Ewan, Oetzmann, Carolin, Leightley, Daniel, Zhang, Yuezhou, Folarin, Amos A., Haro, Josep Maria, Penninx, Brenda W. J. H., Bailon, Raquel, Vairavan, Srinivasan, Wykes, Til, Dobson, Richard J. B., Narayan, Vaibhav A., Hotopf, Matthew, Cummins, Nicholas, Consortium, The RADAR-CNS
Background: Captured between clinical appointments using mobile devices, spoken language has potential for objective, more regular assessment of symptom severity and earlier detection of relapse in major depressive disorder. However, research to date has largely been in non-clinical cross-sectional samples of written language using complex machine learning (ML) approaches with limited interpretability. Methods: We describe an initial exploratory analysis of longitudinal speech data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK, Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify interpretable lexical features associated with MDD symptom severity with linear mixed-effects modelling. Interpretable features and high-dimensional vector embeddings were also used to test the prediction performance of four regressor ML models. Results: In English data, MDD symptom severity was associated with 7 features including lexical diversity measures and absolutist language. In Dutch, associations were observed with words per sentence and positive word frequency; no associations were observed in recordings collected in Spain. The predictive power of lexical features and vector embeddings was near chance level across all languages. Limitations: Smaller samples in non-English speech and methodological choices, such as the elicitation prompt, may have also limited the effect sizes observable. A lack of NLP tools in languages other than English restricted our feature choice. Conclusion: To understand the value of lexical markers in clinical research and practice, further research is needed in larger samples across several languages using improved protocols, and ML models that account for within- and between-individual variations in language.
Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
Woloszyn, Hanna, Gagl, Benjamin
The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.
Agent-based visualization of streaming text
Benson, Jordan Riley, Crist, David, Lafleur, Phil, Watson, Benjamin
We present a visualization infrastructure that maps data elements to agents, which have behaviors parameterized by those elements. Dynamic visualizations emerge as the agents change position, alter appearance and respond to one other. Agents move to minimize the difference between displayed agent-to-agent distances, and an input matrix of ideal distances. Our current application is visualization of streaming text. Each agent represents a significant word, visualizing it by displaying the word itself, centered in a circle sized by the frequency of word occurrence. We derive the ideal distance matrix from word cooccurrence, mapping higher co-occurrence to lower distance. To depict co-occurrence in its textual context, the ratio of intersection to circle area approximates the ratio of word co-occurrence to frequency. A networked backend process gathers articles from news feeds, blogs, Digg or Twitter, exploiting online search APIs to focus on user-chosen topics. Resulting visuals reveal the primary topics in text streams as clusters, with agent-based layout moving without instability as data streams change dynamically.
Zipfian Whitening
The word embedding space in neural models is skewed, and correcting this can improve task performance.We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines.From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures.By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective (Oyama et al., EMNLP 2023), and in terms of the loss functions for imbalanced classification (Menon et al.
Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Qiu, Mengyang, Brisebois, Zoe, Sun, Siena
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear. This study examines whether LLMs can approximate individual differences in the phonemic fluency task, where participants generate words beginning with a target letter. We evaluated 34 model configurations, varying prompt specificity, sampling temperature, and model type, and compared outputs to responses from 106 human participants. While some configurations, especially Claude 3.7 Sonnet, matched human averages and lexical preferences, none reproduced the scope of human variability. LLM outputs were consistently less diverse and structurally rigid, and LLM ensembles failed to increase diversity. Network analyses further revealed fundamental differences in retrieval structure between humans and models. These results highlight key limitations in using LLMs to simulate human cognition and behavior.
The Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect
Iaia, Cosimo, Choksi, Bhavin, Wiebers, Emily, Roig, Gemma, Fiebach, Christian J.
The nouns of our language refer to either concrete entities (like a table) or abstract concepts (like justice or love), and cognitive psychology has established that concreteness influences how words are processed. Accordingly, understanding how concreteness is represented in our mind and brain is a central question in psychology, neuroscience, and computational linguistics. While the advent of powerful language models has allowed for quantitative inquiries into the nature of semantic representations, it remains largely underexplored how they represent concreteness. Here, we used behavioral judgments to estimate semantic distances implicitly used by humans, for a set of carefully selected abstract and concrete nouns. Using Representational Similarity Analysis, we find that the implicit representational space of participants and the semantic representations of language models are significantly aligned. We also find that both representational spaces are implicitly aligned to an explicit representation of concreteness, which was obtained from our participants using an additional concreteness rating task. Importantly, using ablation experiments, we demonstrate that the human-to-model alignment is substantially driven by concreteness, but not by other important word characteristics established in psycholinguistics. These results indicate that humans and language models converge on the concreteness dimension, but not on other dimensions.
To MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels
Gerrits, Kyo, Guerberof-Arenas, Ana
This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this effect is highest for HT and lowest for MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research. All the code and data are available at https://github.com/INCREC/Pilot_to_MT_or_not_to_MT