Goto

Collaborating Authors

 asr service


Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia

Mei, Katelyn Xiaoying, Choi, Anna Seo Gyeong, Schellmann, Hilke, Sloane, Mona, Koenecke, Allison

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems' performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric -- the Word Error Rate -- which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.


You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish

Cumbal, Ronald, Moell, Birger, Lopes, Jose, Engwall, Olof

arXiv.org Artificial Intelligence

The performance of Automatic Speech Recognition (ASR) systems has constantly increased in state-of-the-art development. However, performance tends to decrease considerably in more challenging conditions (e.g., background noise, multiple speaker social conversations) and with more atypical speakers (e.g., children, non-native speakers or people with speech disorders), which signifies that general improvements do not necessarily transfer to applications that rely on ASR, e.g., educational software for younger students or language learners. In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. We compare the recognition results using Word Error Rate and analyze the linguistic factors that may generate the observed transcription errors.


Global Performance Disparities Between English-Language Accents in Automatic Speech Recognition

DiChristofano, Alex, Shuster, Henry, Chandra, Shefali, Patwari, Neal

arXiv.org Artificial Intelligence

However, many users are familiar with the frustrating experience of repeatedly not being understood by their voice assistant [16], so much so that frustration with ASR has become a culturally-shared source of comedy [4, 32]. Bias auditing of ASR services has quantified these experiences. English language ASR has higher error rates: for Black Americans compared to white Americans [24, 45], for stigmatised British accents compared to favored British accents [28], for Scottish speakers compared to speakers from California and New Zealand [44], for speakers whose first language is a tone language compared to those whose first language is not [2], for speakers with Indian accents compared to speakers who with "American" accents [31], for speakers whose first language is English compared to those for whom it is not [28]. It should go without saying, but everyone has an accent - there is no "unaccented" version of English [26]. Due to colonization and globalization, different Englishes are spoken around the world. While some English accents may be favored by those with class, race, and national origin privilege [28], there is no technical barrier to building an ASR system which works well on any particular accent. So we are left with the question, why does ASR performance vary as it does as a function of the global English accent spoken?


SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale

Tang, Raphael, Kumar, Karun, Yang, Gefei, Pandey, Akshat, Mao, Yajie, Belyaev, Vladislav, Emmadi, Madhuri, Murray, Craig, Ture, Ferhan, Lin, Jimmy

arXiv.org Artificial Intelligence

End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting. To reduce human labor, we use a third-party ASR system as a weak supervision source, supplemented with labeling functions derived from implicit user feedback. To accelerate inference, we propose to route production-time queries across a pool of CUDA graphs of varying input lengths, the distribution of which best matches the traffic's. Compared to our third-party ASR, we achieve a relative improvement in word-error rate of 8% and a speedup of 600%. Our system, called SpeechNet, currently serves 12 million queries per day on our voice-enabled smart television. To our knowledge, this is the first time a large-scale, Wav2vec-based deployment has been described in the academic literature.


Why AI startups have different economics from classic SaaS startups

#artificialintelligence

Let's rewind the clock a bit. Back in the day, software vendors would write code, package it, and often distribute physically (through those nifty things called CDs). In this old world, buyers were shouldering most of the operational costs, such as running the applications that they bought on their own local data and compute centers (or laptops and desktops). Then came the advent of faster Internet speeds and cloud computing, which really opened up software development and deployment to a whole new world. With that, we started to see a dramatic shift of infrastructure costs back to the software vendor. That is, under the SaaS world, vendors host and manage web apps in their own data centers or cloud environments, allowing buyers to gradually decrease their investment and expenses associated with managing infrastructure.