AITopics | Stolcke, Andreas

Collaborating Authors

Stolcke, Andreas

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Yu, Yu, Yang, Chao-Han Huck, Kolehmainen, Jari, Shivakumar, Prashanth G., Gu, Yile, Ryu, Sungho, Ren, Roger, Luo, Qi, Gourav, Aditya, Chen, I-Fan, Liu, Yi-Chieh, Dinh, Tuan, Gandhe, Ankur, Filimonov, Denis, Ghosh, Shalini, Stolcke, Andreas, Rastow, Ariya, Bulyko, Ivan

arXiv.org Artificial IntelligenceOct-10-2023

However, as the size of the pretrained models increases, the cost associated We propose a neural language modeling system based on with fine-tuning and deploying these models for low-rank adaptation (LoRA) for speech recognition output real-world applications also escalates. To address this practical rescoring. Although pretrained language models (LMs) challenge, a range of parameter-efficient methods (e.g., like BERT have shown superior performance in second-pass adapters, model reprogramming, and prompts) have been proposed rescoring, the high computational cost of scaling up the pretraining [11, 12, 13, 14, 15, 16, 17, 18] to alleviate the computation stage and adapting the pretrained models to specific and memory demands of fine-tuning LLMs. Low-rank domains limit their practical use in rescoring. Here we present adaptation (LoRA) [19] freezes all pretrained parameters in a method based on low-rank decomposition to train a rescoring the LLM and inserts a trainable pair of matrices (acting as a BERT model and adapt it to new domains using only a low-rank decomposition of a full matrix) additively into each fraction (0.08%) of the pretrained parameters. These inserted layer of the Transformer architecture. Compared to other matrices are optimized through a discriminative training objective parameter-efficient training methods, such as adapters [12], along with a correlation-based regularization loss. The LoRA has two distinct advantages: 1) it employs a simple proposed low-rank adaptation RescoreBERT (LoRB) architecture architecture and has the potential to reduce the number of is evaluated on LibriSpeech and internal datasets with trainable parameters compared to alternatives; 2) LoRA does decreased training times by factors between 5.4 and 3.6.

large language model, machine learning, proc, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ASRU57964.2023.10389632

2309.15223

Country:

North America > United States (0.14)
Europe (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Yang, Chao-Han Huck, Gu, Yile, Liu, Yi-Chieh, Ghosh, Shalini, Bulyko, Ivan, Stolcke, Andreas

arXiv.org Artificial IntelligenceOct-10-2023

We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ASRU57964.2023.10389673

2309.15649

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Streaming Speech-to-Confusion Network Speech Recognition

Filimonov, Denis, Pandey, Prabhat, Rastrow, Ariya, Gandhe, Ankur, Stolcke, Andreas

arXiv.org Artificial IntelligenceJun-2-2023

In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.

artificial intelligence, hypothesis, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2306.03778

Country: North America > United States (0.14)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Tankasala, Srinath, Chen, Long, Stolcke, Andreas, Raju, Anirudh, Deng, Qianli, Chandak, Chander, Khare, Aparna, Maas, Roland, Ravichandran, Venkatesh

arXiv.org Artificial IntelligenceMar-27-2023

We propose a novel approach for ASR N-best hypothesis rescoring with graph-based label propagation by leveraging cross-utterance acoustic similarity. In contrast to conventional neural language model (LM) based ASR rescoring/reranking models, our approach focuses on acoustic information and conducts the rescoring collaboratively among utterances, instead of individually. Experiments on the VCTK dataset demonstrate that our approach consistently improves ASR performance, as well as fairness across speaker groups with different accents. Our approach provides a low-cost solution for mitigating the majoritarian bias of ASR systems, without the need to train new domain- or accent-specific models.

machine learning, natural language, utterance, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10096820

2303.15132

Country: North America > United States > Texas (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
(2 more...)

Add feedback

Adaptive Endpointing with Deep Contextual Multi-armed Bandits

Min, Do June, Stolcke, Andreas, Raju, Anirudh, Vaz, Colin, He, Di, Ravichandran, Venkatesh, Trinh, Viet Anh

arXiv.org Artificial IntelligenceMar-23-2023

Current endpointing (EP) solutions learn in a supervised framework, which does not allow the model to incorporate feedback and improve in an online setting. Also, it is a common practice to utilize costly grid-search to find the best configuration for an endpointing model. In this paper, we aim to provide a solution for adaptive endpointing by proposing an efficient method for choosing an optimal endpointing configuration given utterance-level audio features in an online setting, while avoiding hyperparameter grid-search. Our method does not require ground truth labels, and only uses online learning from reward signals without requiring annotated labels. Specifically, we propose a deep contextual multi-armed bandit-based approach, which combines the representational power of neural networks with the action exploration behavior of Thompson modeling algorithms. We compare our approach to several baselines, and show that our deep bandit models also succeed in reducing early cutoff errors while maintaining low latency.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10097142

2303.13407

Country: North America > United States (0.29)

Genre: Research Report (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)

Add feedback

Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech

Zhang, Xin, Vallés-Pérez, Iván, Stolcke, Andreas, Yu, Chengzhu, Droppo, Jasha, Shonibare, Olabanji, Barra-Chicote, Roberto, Ravichandran, Venkatesh

arXiv.org Artificial IntelligenceNov-4-2022

Stuttering is a speech disorder where the natural flow of speech is interrupted by blocks, repetitions or prolongations of syllables, words and phrases. The majority of existing automatic speech recognition (ASR) interfaces perform poorly on utterances with stutter, mainly due to lack of matched training data. Synthesis of speech with stutter thus presents an opportunity to improve ASR for this type of speech. We describe Stutter-TTS, an end-to-end neural text-to-speech model capable of synthesizing diverse types of stuttering utterances. We develop a simple, yet effective prosody-control strategy whereby additional tokens are introduced into source text during training to represent specific stuttering characteristics. By choosing the position of the stutter tokens, Stutter-TTS allows word-level control of where stuttering occurs in the synthesized utterance. We are able to synthesize stutter events with high accuracy (F1-scores between 0.63 and 0.84, depending on stutter type). By fine-tuning an ASR model on synthetic stuttered speech we are able to reduce word error by 5.7% relative on stuttered utterances, with only minor (< 0.2% relative) degradation for fluent utterances.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2211.09731

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Dheram, Pranav, Ramakrishnan, Murugesan, Raju, Anirudh, Chen, I-Fan, King, Brian, Powell, Katherine, Saboowala, Melissa, Shetty, Karan, Stolcke, Andreas

arXiv.org Artificial IntelligenceJul-22-2022

As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system. We compare cohort discovery based on geographic and demographic information to a more scalable method that groups speakers without human labels, using speaker embedding technology. For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts, without deteriorating overall recognition accuracy.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2022-10816

2207.11345

Country: Europe (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Graph-based Multi-View Fusion and Local Adaptation: Mitigating Within-Household Confusability for Speaker Identification

Chen, Long, Meng, Yixiong, Ravichandran, Venkatesh, Stolcke, Andreas

arXiv.org Artificial IntelligenceJul-8-2022

Speaker identification (SID) in the household scenario (e.g., for smart speakers) is an important but challenging problem due to limited number of labeled (enrollment) utterances, confusable voices, and demographic imbalances. Conventional speaker recognition systems generalize from a large random sample of speakers, causing the recognition to underperform for households drawn from specific cohorts or otherwise exhibiting high confusability. In this work, we propose a graph-based semi-supervised learning approach to improve household-level SID accuracy and robustness with locally adapted graph normalization and multi-signal fusion with multi-view graphs. Unlike other work on household SID, fairness, and signal fusion, this work focuses on speaker label inference (scoring) and provides a simple solution to realize household-specific adaptation and multi-signal fusion without tuning the embeddings or training a fusion network. Experiments on the VoxCeleb dataset demonstrate that our approach consistently improves the performance across households with different customer cohorts and degrees of confusability.

artificial intelligence, household, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2022-11053

2207.04081

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.93)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.73)

Add feedback

CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals

Novotney, Scott, Mukherjee, Sreeparna, Ahmed, Zeeshan, Stolcke, Andreas

arXiv.org Artificial IntelligenceMar-16-2022

We propose a framework to modularize the training of neural language models that use diverse forms of sentence-external context (including metadata) by eliminating the need to jointly train sentence-external and within-sentence encoders. Our approach, contextual universal embeddings (CUE), trains LMs on one set of context, such as date and author, and adapts to novel metadata types, such as article title, or previous sentence. The model consists of a pretrained neural sentence LM, a BERT-based context encoder, and a masked transformer decoder that estimates LM probabilities using sentence-internal and sentence-external information. When context or metadata are unavailable, our model learns to combine contextual and sentence-internal information using noisy oracle unigram embeddings as a proxy. Real contextual information can be introduced later and used to adapt a small number of parameters that map contextual data into the decoder's embedding space. We validate the CUE framework on a NYTimes text corpus with multiple metadata types, for which the LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context. Bootstrapping a contextual LM with only a subset of the context/metadata during training retains 85\% of the achievable gain. Training the model initially with proxy context retains 67% of the perplexity gain after adapting to real context. Furthermore, we can swap one type of pretrained sentence LM for another without retraining the context encoders, by only adapting the decoder model. Overall, we obtain a modular framework that allows incremental, scalable training of context-enhanced LMs.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2022.findings-acl.265

2203.08774

Country:

Europe (1.00)
North America > United States (0.68)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition

Yang, Chao-Han Huck, Ahmed, Zeeshan, Gu, Yile, Szurley, Joseph, Ren, Roger, Liu, Linda, Stolcke, Andreas, Bulyko, Ivan

arXiv.org Artificial IntelligenceFeb-17-2022

In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without accessing gradient information of a targeted ASR model directly. We propose an advanced Bayesian neural network (BNN) based adversarial detector, which could model latent distributions against adaptive adversarial perturbation with divergence measurement. We further simulate deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR systems with the proposed adversarial detection system. Leveraging the proposed BNN based detection system, we improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets compared to the current model enhancement methods against the adversarial speech examples.

artificial intelligence, enhanced end-to-end speech recognition, mitigating closed-model adversarial example, (1 more...)

arXiv.org Artificial Intelligence

2202.08532

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

Add feedback