Goto

Collaborating Authors

 aphasia


The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.


Practical Machine Learning for Aphasic Discourse Analysis

arXiv.org Artificial Intelligence

Analyzing spoken discourse is a valid means of quantifying language ability in persons with aphasia. There are many ways to quantify discourse, one common way being to evaluate the informativeness of the discourse. That is, given the total number of words produced, how many of those are context-relevant and accurate. This type of analysis is called Correct Information Unit (CIU) analysis and is one of the most prevalent discourse analyses used by speech-language pathologists (SLPs). Despite this, CIU analysis in the clinic remains limited due to the manual labor needed by SLPs to code and analyze collected speech. Recent advances in machine learning (ML) seek to augment such labor by automating modeling of propositional, macrostructural, pragmatic, and multimodal dimensions of discourse. To that end, this study evaluated five ML models for reliable identification of Correct Information Units (CIUs, Nicholas & Brookshire, 1993), during a picture description task. The five supervised ML models were trained using randomly selected human-coded transcripts and accompanying words and CIUs from persons with aphasia. The baseline model training produced a high accuracy across transcripts for word vs non-word, with all models achieving near perfect performance (0.995) with high AUC range (0.914 min, 0.995 max). In contrast, CIU vs non-CIU showed a greater variability, with the k-nearest neighbor (k-NN) model the highest accuracy (0.824) and second highest AUC (0.787). These findings indicate that while the supervised ML models can distinguish word from not word, identifying CIUs is challenging.


Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers

arXiv.org Artificial Intelligence

Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interd isciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real - time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose - built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real hum an language. In recent years, the development of large language models (LLMs) such as GPT - 4 and Gemini has brought a revolution in the field of natural language processing (NLP). These models, based on transformer architectures ( Vaswani et al., 2017), have demonstrated unprecedented abilities to generate coherent text, perform automatic translation, and produce complex summaries. Their impressive performance has transformed many applications, from chatbots and virtual assistants to automated content creation and languag e learning tools. However, despite these technological advances, LLMs still face significant challenges rooted in the inherently complex, ambiguous, and variable nature of real human language ( Bender et al., 2021). Particularly, irregularities, ambiguities, and errors commonly found in informal and spontaneous contexts, such as everyday conversations or social media interactions, constitute a major obstacle for the optimal functioning of these systems. Human language is not a rigid or perfectly normative system; rather, it is a dynamic phenomenon that reflects complex cognitive processes and is characterized by variations and errors in production and comprehension ( Levelt, 1989). Linguistic errors produced even by native speakers should not be dismissed as mere random deviations but regarded as systematic manifestations that can provide valuable insight into the internal functioning of the linguistic system and its neurological foundations ( Fromkin, 2013) .


Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

arXiv.org Artificial Intelligence

Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts Jason M. Pittman1, Anton Phillips Jr.2, Yesenia Medina-Santos2, Brielle C. Stark2 1University of Maryland Global Campus 2Indiana University Bloomington, Department of Speech, Language and Hearing Sciences ABSTRACT In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts. Keywords: aphasia, synthetic data, natural language processing, machine learning Introduction Per Nicholas and Brookshire (1993), coding Correct Information Units (CIUs) involves transcribing a connected speech sample verbatim, counting all intelligible words, and then identifying each word that is intelligible, accurate, relevant, and informative about the topic as a CIU--excluding fillers, repetitions, and tangential remarks. From these counts, clinicians calculate the percentage of CIUs and CIUs per minute to quantify communicative informativeness and efficiency.


AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition

arXiv.org Artificial Intelligence

This paper proposes AS-ASR, a lightweight aphasia-specific speech recognition framework based on Whisper-tiny, tailored for low-resource deployment on edge devices. Our approach introduces a hybrid training strategy that systematically combines standard and aphasic speech at varying ratios, enabling robust generalization, and a GPT-4-based reference enhancement method that refines noisy aphasic transcripts, improving supervision quality. We conduct extensive experiments across multiple data mixing configurations and evaluation settings. Results show that our fine-tuned model significantly outperforms the zero-shot baseline, reducing WER on aphasic speech by over 30% while preserving performance on standard speech. The proposed framework offers a scalable, efficient solution for real-world disordered speech recognition.


Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems' performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric -- the Word Error Rate -- which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.


Learning from Impairment: Leveraging Insights from Clinical Linguistics in Language Modelling Research

arXiv.org Artificial Intelligence

This position paper investigates the potential of integrating insights from language impairment research and its clinical treatment to develop human-inspired learning strategies and evaluation frameworks for language models (LMs). We inspect the theoretical underpinnings underlying some influential linguistically motivated training approaches derived from neurolinguistics and, particularly, aphasiology, aimed at enhancing the recovery and generalization of linguistic skills in aphasia treatment, with a primary focus on those targeting the syntactic domain. We highlight how these insights can inform the design of rigorous assessments for LMs, specifically in their handling of complex syntactic phenomena, as well as their implications for developing human-like learning strategies, aligning with efforts to create more sustainable and cognitively plausible natural language processing (NLP) models.


HALO: Hallucination Analysis and Learning Optimization to Empower LLMs with Retrieval-Augmented Context for Guided Clinical Decision Making

arXiv.org Artificial Intelligence

Large language models (LLMs) have significantly advanced natural language processing tasks, yet they are susceptible to generating inaccurate or unreliable responses, a phenomenon known as hallucination. In critical domains such as health and medicine, these hallucinations can pose serious risks. This paper introduces HALO, a novel framework designed to enhance the accuracy and reliability of medical question-answering (QA) systems by focusing on the detection and mitigation of hallucinations. Our approach generates multiple variations of a given query using LLMs and retrieves relevant information from external open knowledge bases to enrich the context. We utilize maximum marginal relevance scoring to prioritize the retrieved context, which is then provided to LLMs for answer generation, thereby reducing the risk of hallucinations. The integration of LangChain further streamlines this process, resulting in a notable and robust increase in the accuracy of both open-source and commercial LLMs, such as Llama-3.1 (from 44% to 65%) and ChatGPT (from 56% to 70%). This framework underscores the critical importance of addressing hallucinations in medical QA systems, ultimately improving clinical decision-making and patient care. The open-source HALO is available at: https://github.com/ResponsibleAILab/HALO.


A Lesion-aware Edge-based Graph Neural Network for Predicting Language Ability in Patients with Post-stroke Aphasia

arXiv.org Artificial Intelligence

We propose a lesion-aware graph neural network (LEGNet) to predict language ability from resting-state fMRI (rs-fMRI) connectivity in patients with post-stroke aphasia. Our model integrates three components: an edge-based learning module that encodes functional connectivity between brain regions, a lesion encoding module, and a subgraph learning module that leverages functional similarities for prediction. We use synthetic data derived from the Human Connectome Project (HCP) for hyperparameter tuning and model pretraining. We then evaluate the performance using repeated 10-fold cross-validation on an in-house neuroimaging dataset of post-stroke aphasia. Our results demonstrate that LEGNet outperforms baseline deep learning methods in predicting language ability. LEGNet also exhibits superior generalization ability when tested on a second in-house dataset that was acquired under a slightly different neuroimaging protocol. Taken together, the results of this study highlight the potential of LEGNet in effectively learning the relationships between rs-fMRI connectivity and language ability in a patient cohort with brain lesions for improved post-stroke aphasia evaluation.


Careless Whisper: Speech-to-Text Hallucination Harms

arXiv.org Artificial Intelligence

Use of such speech-to-text APIs is increasingly prevalent in high-stakes downstream applications, ranging from surveillance of incarcerated people [22] to medical care [14]. While such speech-to-text APIs can generate written transcriptions more quickly than human transcribers, there are grave concerns regarding bias in automated transcription accuracy, e.g., underperformance for African American English speakers [11] and speakers with speech impairments such as dysphonia [12]. These biases within APIs can perpetuate disparities when real-world decisions are made based on automated speech-to-text transcriptions--from police making carceral judgements to doctors making treatment decisions. OpenAI released its Whisper speech-to-text API in September 2022 with experiments showing better speech transcription accuracy relative to market competitors [19]. We evaluate Whisper's transcription performance on the axis of "hallucinations," defined as undesirable generated text "that is nonsensical, or unfaithful to the provided source input" [10]. Our approach compares the ground truth of a speech snippet with the outputted transcription; we find hallucinations in roughly 1% of transcriptions generated in mid-2023, wherein Whisper hallucinates entire made-up sentences when no one is speaking in the input audio files. While hallucinations have been increasingly studied in the context of text generated by ChatGPT (a language model also made by OpenAI) [8, 10], hallucinations have only been considered in speech-to-text models as a means to study error prediction [21], and not as a fundamental concern in and of itself. In this paper, we provide experimental quantification of Whisper hallucinations, finding that nearly 40% of the hallucinations are harmful or concerning in some way (as opposed to innocuous and random).