AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Neural Information Processing SystemsFeb-18-2026, 09:37:38 GMT

Zipfian Whitening

The word embedding space in neural models is skewed, and correcting this can improve task performance.

artificial intelligence, machine learning, natural language, (21 more...)

Country:

Asia > Japan > Honshū > Tōhoku (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > Dominican Republic (0.04)
(14 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

arXiv.org Artificial IntelligenceNov-11-2025

Multilingual Lexical Feature Analysis of Spoken Language for Predicting Major Depression Symptom Severity

Tokareva, Anastasiia, Dineley, Judith, Firth, Zoe, Conde, Pauline, Matcham, Faith, Siddi, Sara, Lamers, Femke, Carr, Ewan, Oetzmann, Carolin, Leightley, Daniel, Zhang, Yuezhou, Folarin, Amos A., Haro, Josep Maria, Penninx, Brenda W. J. H., Bailon, Raquel, Vairavan, Srinivasan, Wykes, Til, Dobson, Richard J. B., Narayan, Vaibhav A., Hotopf, Matthew, Cummins, Nicholas, Consortium, The RADAR-CNS

Background: Captured between clinical appointments using mobile devices, spoken language has potential for objective, more regular assessment of symptom severity and earlier detection of relapse in major depressive disorder. However, research to date has largely been in non-clinical cross-sectional samples of written language using complex machine learning (ML) approaches with limited interpretability. Methods: We describe an initial exploratory analysis of longitudinal speech data and PHQ-8 assessments from 5,836 recordings of 586 participants in the UK, Netherlands, and Spain, collected in the RADAR-MDD study. We sought to identify interpretable lexical features associated with MDD symptom severity with linear mixed-effects modelling. Interpretable features and high-dimensional vector embeddings were also used to test the prediction performance of four regressor ML models. Results: In English data, MDD symptom severity was associated with 7 features including lexical diversity measures and absolutist language. In Dutch, associations were observed with words per sentence and positive word frequency; no associations were observed in recordings collected in Spain. The predictive power of lexical features and vector embeddings was near chance level across all languages. Limitations: Smaller samples in non-English speech and methodological choices, such as the elicitation prompt, may have also limited the effect sizes observable. A lack of NLP tools in languages other than English restricted our feature choice. Conclusion: To understand the value of lexical markers in clinical research and practice, further research is needed in larger samples across several languages using improved protocols, and ML models that account for within- and between-individual variations in language.

lexical feature, machine learning, natural language, (18 more...)

2511.07011

Country:

Europe > Spain (1.00)
North America > United States (0.93)
Europe > United Kingdom > England > Greater London > London (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 18:46:43 GMT

dd1fef536655685898a6602bfbf16857-Paper-Conference.pdf

frequency, vector, zipfian, (16 more...)

Country:

Asia > Japan > Honshū > Tōhoku (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > Dominican Republic (0.04)
(14 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Woloszyn, Hanna, Gagl, Benjamin

Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study

arXiv.org Artificial IntelligenceAug-20-2025

The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.

large language model, machine learning, natural language, (17 more...)

2508.13769

Country: Europe (0.46)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Therapeutic Area (0.47)
Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Benson, Jordan Riley, Crist, David, Lafleur, Phil, Watson, Benjamin

Agent-based visualization of streaming text

arXiv.org Artificial IntelligenceJul-15-2025

We present a visualization infrastructure that maps data elements to agents, which have behaviors parameterized by those elements. Dynamic visualizations emerge as the agents change position, alter appearance and respond to one other. Agents move to minimize the difference between displayed agent-to-agent distances, and an input matrix of ideal distances. Our current application is visualization of streaming text. Each agent represents a significant word, visualizing it by displaying the word itself, centered in a circle sized by the frequency of word occurrence. We derive the ideal distance matrix from word cooccurrence, mapping higher co-occurrence to lower distance. To depict co-occurrence in its textual context, the ratio of intersection to circle area approximates the ratio of word co-occurrence to frequency. A networked backend process gathers articles from news feeds, blogs, Digg or Twitter, exploiting online search APIs to focus on user-chosen topics. Resulting visuals reveal the primary topics in text streams as clusters, with agent-based layout moving without instability as data streams change dynamically.

artificial intelligence, information management, visualization, (19 more...)

2507.08884

Country: North America > United States (0.15)

Genre:

Instructional Material > Online (0.62)
Instructional Material > Course Syllabus & Notes (0.62)

Industry: Media > News (0.35)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Neural Information Processing SystemsMay-27-2025, 19:18:16 GMT

Zipfian Whitening

The word embedding space in neural models is skewed, and correcting this can improve task performance.We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines.From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures.By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective (Oyama et al., EMNLP 2023), and in terms of the loss functions for imbalanced classification (Menon et al.

empirical word frequency, word frequency, zipfian whitening, (1 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Qiu, Mengyang, Brisebois, Zoe, Sun, Siena

Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

arXiv.org Artificial IntelligenceMay-23-2025

Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear. This study examines whether LLMs can approximate individual differences in the phonemic fluency task, where participants generate words beginning with a target letter. We evaluated 34 model configurations, varying prompt specificity, sampling temperature, and model type, and compared outputs to responses from 106 human participants. While some configurations, especially Claude 3.7 Sonnet, matched human averages and lexical preferences, none reproduced the scope of human variability. LLM outputs were consistently less diverse and structurally rigid, and LLM ensembles failed to increase diversity. Network analyses further revealed fundamental differences in retrieval structure between humans and models. These results highlight key limitations in using LLMs to simulate human cognition and behavior.

large language model, machine learning, natural language, (22 more...)

2505.16164

Country: North America (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Iaia, Cosimo, Choksi, Bhavin, Wiebers, Emily, Roig, Gemma, Fiebach, Christian J.

The Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect

arXiv.org Artificial IntelligenceMay-22-2025

The nouns of our language refer to either concrete entities (like a table) or abstract concepts (like justice or love), and cognitive psychology has established that concreteness influences how words are processed. Accordingly, understanding how concreteness is represented in our mind and brain is a central question in psychology, neuroscience, and computational linguistics. While the advent of powerful language models has allowed for quantitative inquiries into the nature of semantic representations, it remains largely underexplored how they represent concreteness. Here, we used behavioral judgments to estimate semantic distances implicitly used by humans, for a set of carefully selected abstract and concrete nouns. Using Representational Similarity Analysis, we find that the implicit representational space of participants and the semantic representations of language models are significantly aligned. We also find that both representational spaces are implicitly aligned to an explicit representation of concreteness, which was obtained from our participants using an additional concreteness rating task. Importantly, using ablation experiments, we demonstrate that the human-to-model alignment is substantially driven by concreteness, but not by other important word characteristics established in psycholinguistics. These results indicate that humans and language models converge on the concreteness dimension, but not on other dimensions.

artificial intelligence, natural language, text processing, (19 more...)

2505.15682

Country:

North America > United States > Minnesota (0.28)
Europe > Germany (0.28)

Genre: Research Report > New Finding (0.95)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Gerrits, Kyo, Guerberof-Arenas, Ana

To MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels

arXiv.org Artificial IntelligenceApr-29-2025

This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this effect is highest for HT and lowest for MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research. All the code and data are available at https://github.com/INCREC/Pilot_to_MT_or_not_to_MT

artificial intelligence, natural language, participant, (17 more...)

2504.1985

Country: Europe > United Kingdom > England (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (0.93)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)