Goto

Collaborating Authors

 speech impairment


A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages

arXiv.org Artificial Intelligence

This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a "cookbook" of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.


Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech

arXiv.org Artificial Intelligence

Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition (ASR) systems. Despite recent advances, ASR models like Whisper struggle with non-normative speech due to limited training data and the difficulty of collecting and annotating non-normative speech samples. In this work, we propose a practical and lightweight pipeline to personalize ASR models, formalizing the selection of words and enriching a small, speech-impaired dataset with semantic coherence. Applied to data from a child with a structural speech impairment, our approach shows promising improvements in transcription quality, demonstrating the potential to reduce communication barriers for individuals with atypical speech patterns.


Inclusivity of AI Speech in Healthcare: A Decade Look Back

arXiv.org Artificial Intelligence

The integration of AI speech recognition technologies into healthcare has the potential to revolutionize clinical workflows and patient-provider communication. However, this study reveals significant gaps in inclusivity, with datasets and research disproportionately favouring high-resource languages, standardized accents, and narrow demographic groups. These biases risk perpetuating healthcare disparities, as AI systems may misinterpret speech from marginalized groups. This paper highlights the urgent need for inclusive dataset design, bias mitigation research, and policy frameworks to ensure equitable access to AI speech technologies in healthcare.


Careless Whisper: Speech-to-Text Hallucination Harms

arXiv.org Artificial Intelligence

Use of such speech-to-text APIs is increasingly prevalent in high-stakes downstream applications, ranging from surveillance of incarcerated people [22] to medical care [14]. While such speech-to-text APIs can generate written transcriptions more quickly than human transcribers, there are grave concerns regarding bias in automated transcription accuracy, e.g., underperformance for African American English speakers [11] and speakers with speech impairments such as dysphonia [12]. These biases within APIs can perpetuate disparities when real-world decisions are made based on automated speech-to-text transcriptions--from police making carceral judgements to doctors making treatment decisions. OpenAI released its Whisper speech-to-text API in September 2022 with experiments showing better speech transcription accuracy relative to market competitors [19]. We evaluate Whisper's transcription performance on the axis of "hallucinations," defined as undesirable generated text "that is nonsensical, or unfaithful to the provided source input" [10]. Our approach compares the ground truth of a speech snippet with the outputted transcription; we find hallucinations in roughly 1% of transcriptions generated in mid-2023, wherein Whisper hallucinates entire made-up sentences when no one is speaking in the input audio files. While hallucinations have been increasingly studied in the context of text generated by ChatGPT (a language model also made by OpenAI) [8, 10], hallucinations have only been considered in speech-to-text models as a means to study error prediction [21], and not as a fundamental concern in and of itself. In this paper, we provide experimental quantification of Whisper hallucinations, finding that nearly 40% of the hallucinations are harmful or concerning in some way (as opposed to innocuous and random).


An analysis of degenerating speech due to progressive dysarthria on ASR performance

arXiv.org Artificial Intelligence

Although personalized automatic speech recognition (ASR) models have recently been designed to recognize even severely impaired speech, model performance may degrade over time for persons with degenerating speech. The aims of this study were to (1) analyze the change of performance of ASR over time in individuals with degrading speech, and (2) explore mitigation strategies to optimize recognition throughout disease progression. Speech was recorded by four individuals with degrading speech due to amyotrophic lateral sclerosis (ALS). Word error rates (WER) across recording sessions were computed for three ASR models: Unadapted Speaker Independent (U-SI), Adapted Speaker Independent (A-SI), and Adapted Speaker Dependent (A-SD or personalized). The performance of all three models degraded significantly over time as speech became more impaired, but the performance of the A-SD model improved markedly when it was updated with recordings from the severe stages of speech progression. Recording additional utterances early in the disease before speech degraded significantly did not improve the performance of A-SD models. Overall, our findings emphasize the importance of continuous recording (and model retraining) when providing personalized models for individuals with progressive speech impairments.


Assessing ASR Model Quality on Disordered Speech using BERTScore

arXiv.org Artificial Intelligence

Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality. It has been shown that ASR models tend to have much higher WER on speakers with speech impairments than typical English speakers. It is hard to determine if models can be be useful at such high error rates. This study investigates the use of BERTScore, an evaluation metric for text generation, to provide a more informative measure of ASR model quality and usefulness. Both BERTScore and WER were compared to prediction errors manually annotated by Speech Language Pathologists for error type and assessment. BERTScore was found to be more correlated with human assessment of error type and assessment. BERTScore was specifically more robust to orthographic changes (contraction and normalization errors) where meaning was preserved. Furthermore, BERTScore was a better fit of error assessment than WER, as measured using an ordinal logistic regression and the Akaike's Information Criterion (AIC). Overall, our findings suggest that BERTScore can complement WER when assessing ASR model performance from a practical perspective, especially for accessibility applications where models are useful even at lower accuracy than for typical speech.


Podcast: How AI is giving a woman back her voice

MIT Technology Review

Voice technology is one of the biggest trends in the healthcare space. We look at how it might help care providers and patients, from a woman who is losing her speech, to documenting healthcare records for doctors. But how do you teach AI to learn to communicate more like a human, and will it lead to more efficient machines? This episode was reported and produced by Anthony Green with help from Jennifer Strong and Emma Cillekens. It was edited by Michael Reilly. Our mix engineer is Garret Lang and our theme music is by Jacob Gorski. Jennifer: Healthcare looks a little different than it did not so long agoโ€ฆwhen your doctor likely wrote down details about your condition on a piece of paper...


Watch: Google unveils new AI app to help people with speech impairments

#artificialintelligence

Google is seeking volunteers for a new beta app called Project Relate, which aims to provide people with speech impairments with a voice assistant that can transcribe their speech in real time as well synthesize what they are saying. The app is part of Project Euphoria, which is a wider endeavor started in 2019 that's aimed at collecting data to be used for improving Google's AI algorithms when it comes to handling speech from people who "have difficulty being understood by others," such as those affected by neurological conditions. As for the Relate app, it has three key features. The Listen feature will transcribe a user's speech in real time, allowing them to copy and paste into other apps or show to other people. The Repeat feature will restate what the user is saying in a "clear synthesized voice," which Google hopes will aid face-to-face conversations and help when people with speech impairments want to speak a command to a smart home device.


Google made an app to ease communication for people with speech impairments

Engadget

For too long, people with speech impairments have struggled to be understood not only by other people, but also by voice-based technology. Though some companies have started to make their products work better for people with atypical speech, the most prevalent services still don't hear them well. Google announced today that it's made a new Android app called Project Relate that could help people with speech impairments communicate more easily with others and the Assistant. It's looking for beta testers to test and improve the app starting today. Like product manager for Google Research Julie Cattiau said in a video, "standard speech recognition doesn't always work as well for people with atypical speech because the algorithms have not been trained on samples of their speech."


Text to Speech Technology: How Voice Computing is Building a More Accessible World

#artificialintelligence

In a world where new technology emerges at exponential rates, and our daily lives are increasingly mediated by speakers and sound waves, text to speech technology is the latest force evolving the way we communicate. Text to speech technology refers to a field of computer science that enables the conversion of language text into audible speech. Also known as voice computing, text to speech (TTS) often involves building a database of recorded human speech to train a computer to produce sound waves that resemble the natural sound of a human speaking. This process is called speech synthesis. The technology is trailblazing and major breakthroughs in the field occur regularly.