Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

Pittman, Jason M., Phillips, Anton Jr., Medina-Santos, Yesenia, Stark, Brielle C.

Oct-31-2025–arXiv.org Artificial Intelligence

Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts Jason M. Pittman1, Anton Phillips Jr.2, Yesenia Medina-Santos2, Brielle C. Stark2 1University of Maryland Global Campus 2Indiana University Bloomington, Department of Speech, Language and Hearing Sciences ABSTRACT In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts. Keywords: aphasia, synthetic data, natural language processing, machine learning Introduction Per Nicholas and Brookshire (1993), coding Correct Information Units (CIUs) involves transcribing a connected speech sample verbatim, counting all intelligible words, and then identifying each word that is intelligible, accurate, relevant, and informative about the topic as a CIU--excluding fillers, repetitions, and tangential remarks. From these counts, clinicians calculate the percentage of CIUs and CIUs per minute to quantify communicative informativeness and efficiency.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-31-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Maryland (0.24)

Genre:
- Research Report (0.82)

Industry:
- Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.35)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found