Goto

Collaborating Authors

 language use


Characterizing Language Use in a Collaborative Situated Game

Tomlin, Nicholas, Zhou, Naitian, Fleisig, Eve, Chen, Liangyuan, Wright, Téa, Vinh, Lauren, Ma, Laura X., Eisape, Seun, French, Ellie, Du, Tingting, Zhang, Tianjiao, Koller, Alexander, Suhr, Alane

arXiv.org Artificial Intelligence

Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.


NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use

Zhang, Yuqing, Ürker, Ecesu, Verhoef, Tessa, Boleda, Gemma, Bisazza, Arianna

arXiv.org Artificial Intelligence

Lexical semantic change has primarily been investigated with observational and experimental methods; however, observational methods (corpus analysis, distributional semantic modeling) cannot get at causal mechanisms, and experimental paradigms with humans are hard to apply to semantic change due to the extended diachronic processes involved. This work introduces NeLLCom-Lex, a neural-agent framework designed to simulate semantic change by first grounding agents in a real lexical system (e.g. English) and then systematically manipulating their communicative needs. Using a well-established color naming task, we simulate the evolution of a lexical system within a single generation, and study which factors lead agents to: (i) develop human-like naming behavior and lexicons, and (ii) change their behavior and lexicons according to their communicative needs. Our experiments with different supervised and reinforcement learning pipelines show that neural agents trained to 'speak' an existing language can reproduce human-like patterns in color naming to a remarkable extent, supporting the further use of NeLLCom-Lex to elucidate the mechanisms of semantic change.


Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

van Dijk, Bram, Kuiper, Tiberon, Ahmed, Sirin Aoulad si, Levebvre, Armel, Johnson, Jake, Duin, Jan, Mooijaart, Simon, Spruit, Marco

arXiv.org Artificial Intelligence

Voice-controlled interfaces can support older adults in clinical contexts -- with chatbots being a prime example -- but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to real-world datasets. Moreover, our results indicate that truncating generic models is helpful in balancing the accuracy-speed trade-off. Nonetheless, we also find inputs which cause a high word error rate and place them in context.


An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment

Lo, Tien-Hong, Chen, Szu-Yu, Sung, Yao-Ting, Chen, Berlin

arXiv.org Artificial Intelligence

Abstract--A recent line of research on automated speaking assessment (ASA) has benefited from self-supervised learning (SSL) representations, which capture rich acoustic and linguistic patterns in non-native speech without underlying assumptions of feature curation. However, speech-based SSL models capture acoustic-related traits but overlook linguistic content, while text-based SSL models rely on ASR output and fail to encode prosodic nuances. Moreover, most prior arts treat proficiency levels as nominal classes, ignoring their ordinal structure and non-uniform intervals between proficiency labels. T o address these limitations, we propose an effective ASA approach combining SSL with handcrafted indicator features via a novel modeling paradigm. We further introduce a multi-margin ordinal loss that jointly models both the score ordinality and non-uniform intervals of proficiency labels. Extensive experiments on the TEEMI corpus show that our method consistently outperforms strong baselines and generalizes well to unseen prompts.


Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

Song, Dan, Lee, Won-Chan, Jiao, Hong

arXiv.org Artificial Intelligence

Using g eneralizability t heory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free - response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corres ponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large - scale writing assessments. Keywords: large language model; a utomated essay s coring; generalizability theory; w riting a ssessment; AI - h uman c omparison 2 Exploring LLM Autoscoring Reliability in Large - Scale Writing Assessments Using Generalizability Theory The integration of large language models (LLMs) into a utomated e ssay s coring (AES) represents a significant shift in how essay scoring is approached. While traditional AES systems have long depended on manually engineered features and statistical models (Attali & Burstein, 2006; Dikli, 2006), LLMs offer the potential to assess student writing with greater flexibility and contextual sensitivi ty by drawing on deep learning architectures trained on diverse textual corpora ( Ifenthaler, 2022; Ouyang et al., 2022). However, despite their promising capabilities, recent studies indic ate that LLMs have not yet consistently matched the scoring reliability of established AES tools or trained human raters, especially in high - stakes language assessment contexts (Mizumoto & Eguchi, 2023; Xiao et al., 2025; Y ancey et al., 2023). These concerns highlight the need for rigorous evaluation of LLM - based scoring systems, particularly with respect to their reliability and alignment with human scoring standards. This study addresses these challenges by applying generalizability theory to systematical ly examine the consistency of LLM - generated scores on standardized writing tasks in the AP Chinese Language and Culture Exam (AP Chinese Exam) . Literature Review This section reviews the literature on AES and the application of LLMs to AES. It also provides brief overviews of generalizability theory and the AP Chinese Language and Culture Exam, followed by the research questions addressed in this study.


Advancing Automated Speaking Assessment Leveraging Multifaceted Relevance and Grammar Information

Lu, Hao-Chien, Lin, Jhen-Ke, Lin, Hong-Yun, Wang, Chung-Chun, Chen, Berlin

arXiv.org Artificial Intelligence

Current automated speaking assessment (ASA) systems for use in multi-aspect evaluations often fail to make full use of content relevance, overlooking image or exemplar cues, and employ superficial grammar analysis that lacks detailed error types. This paper ameliorates these deficiencies by introducing two novel enhancements to construct a hybrid scoring model. First, a multifaceted relevance module integrates question and the associated image content, exemplar, and spoken response of an L2 speaker for a comprehensive assessment of content relevance. Second, fine-grained grammar error features are derived using advanced grammar error correction (GEC) and detailed annotation to identify specific error categories. Experiments and ablation studies demonstrate that these components significantly improve the evaluation of content relevance, language use, and overall ASA performance, highlighting the benefits of using richer, more nuanced feature sets for holistic speaking assessment.


LLMs syntactically adapt their language use to their conversational partner

Kandra, Florian, Demberg, Vera, Koller, Alexander

arXiv.org Artificial Intelligence

It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.


Whose story is it? Personalizing story generation by inferring author styles

Kumar, Nischal Ashok, Pham, Chau Minh, Iyyer, Mohit, Lan, Andrew

arXiv.org Artificial Intelligence

Personalization has become essential for improving user experience in interactive writing and educational applications, yet its potential in story generation remains largely unexplored. In this work, we propose a novel two-stage pipeline for personalized story generation. Our approach first infers an author's implicit story-writing characteristics from their past work and organizes them into an Author Writing Sheet, inspired by narrative theory. The second stage uses this sheet to simulate the author's persona through tailored persona descriptions and personalized story writing rules. To enable and validate our approach, we construct Mythos, a dataset of 590 stories from 64 authors across five distinct sources that reflect diverse story-writing settings. A head-to-head comparison with a non-personalized baseline demonstrates our pipeline's effectiveness in generating high-quality personalized stories. Our personalized stories achieve a 75 percent win rate (versus 14 percent for the baseline and 11 percent ties) in capturing authors' writing style based on their past works. Human evaluation highlights the high quality of our Author Writing Sheet and provides valuable insights into the personalized story generation task. Notable takeaways are that writings from certain sources, such as Reddit, are easier to personalize than others, like AO3, while narrative aspects, like Creativity and Language Use, are easier to personalize than others, like Plot.


How desirable is alignment between LLMs and linguistically diverse human users?

Knoeferle, Pia, Möller, Sebastian, Kolossa, Dorothea, Solopova, Veronika, Rehm, Georg

arXiv.org Artificial Intelligence

We discuss how desirable it is that Large Language Models (LLMs) be able to adapt or align their language behavior with users who may be diverse in their language use. User diversity may come about among others due to i) age differences; ii) gender characteristics, and/or iii) multilingual experience, and associated differences in language processing and use. We consider potential consequences for usability, communication, and LLM development.


Syntactic Evolution in Language Usage

Kumar, Surbhit

arXiv.org Artificial Intelligence

This research aims to investigate the dynamic nature of linguistic style throughout various stages of life, from post teenage to old age. By employing linguistic analysis tools and methodologies, the study will delve into the intricacies of how individuals adapt and modify their language use over time. The research uses a data set of blogs from blogger.com from 2004 and focuses on English for syntactic analysis. The findings of this research can have implications for linguistics, psychology, and communication studies, shedding light on the intricate relationship between age and language.