Gandhe, Ankur
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Lin, Guan-Ting, Shivakumar, Prashanth Gurunath, Gourav, Aditya, Gu, Yile, Gandhe, Ankur, Lee, Hung-yi, Bulyko, Ivan
While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
Speech Recognition Rescoring with Large Speech-Text Foundation Models
Shivakumar, Prashanth Gurunath, Kolehmainen, Jari, Gourav, Aditya, Gu, Yi, Gandhe, Ankur, Rastrow, Ariya, Bulyko, Ivan
Large language models (LLM) have demonstrated the ability to understand human language by leveraging large amount of text data. Automatic speech recognition (ASR) systems are often limited by available transcribed speech data and benefit from a second pass rescoring using LLM. Recently multi-modal large language models, particularly speech and text foundational models have demonstrated strong spoken language understanding. Speech-Text foundational models leverage large amounts of unlabelled and labelled data both in speech and text modalities to model human language. In this work, we propose novel techniques to use multi-modal LLM for ASR rescoring. We also explore discriminative training to further improve the foundational model rescoring performance. We demonstrate cross-modal knowledge transfer in speech-text LLM can benefit rescoring. Our experiments demonstrate up-to 20% relative improvements over Whisper large ASR and up-to 15% relative improvements over text-only LLM.
Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition
Yu, Yu, Yang, Chao-Han Huck, Dinh, Tuan, Ryu, Sungho, Kolehmainen, Jari, Ren, Roger, Filimonov, Denis, Shivakumar, Prashanth G., Gandhe, Ankur, Rastow, Ariya, Xu, Jia, Bulyko, Ivan, Stolcke, Andreas
The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
Lin, Guan-Ting, Shivakumar, Prashanth Gurunath, Gandhe, Ankur, Yang, Chao-Han Huck, Gu, Yile, Ghosh, Shalini, Stolcke, Andreas, Lee, Hung-yi, Bulyko, Ivan
Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.
Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks
Everson, Kevin, Gu, Yile, Yang, Huck, Shivakumar, Prashanth Gurunath, Lin, Guan-Ting, Kolehmainen, Jari, Bulyko, Ivan, Gandhe, Ankur, Ghosh, Shalini, Hamza, Wael, Lee, Hung-yi, Rastrow, Ariya, Stolcke, Andreas
In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR performance conditions and scrutinize the aspects of in-context learning which prove the most influential.
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
Yu, Yu, Yang, Chao-Han Huck, Kolehmainen, Jari, Shivakumar, Prashanth G., Gu, Yile, Ryu, Sungho, Ren, Roger, Luo, Qi, Gourav, Aditya, Chen, I-Fan, Liu, Yi-Chieh, Dinh, Tuan, Gandhe, Ankur, Filimonov, Denis, Ghosh, Shalini, Stolcke, Andreas, Rastow, Ariya, Bulyko, Ivan
However, as the size of the pretrained models increases, the cost associated We propose a neural language modeling system based on with fine-tuning and deploying these models for low-rank adaptation (LoRA) for speech recognition output real-world applications also escalates. To address this practical rescoring. Although pretrained language models (LMs) challenge, a range of parameter-efficient methods (e.g., like BERT have shown superior performance in second-pass adapters, model reprogramming, and prompts) have been proposed rescoring, the high computational cost of scaling up the pretraining [11, 12, 13, 14, 15, 16, 17, 18] to alleviate the computation stage and adapting the pretrained models to specific and memory demands of fine-tuning LLMs. Low-rank domains limit their practical use in rescoring. Here we present adaptation (LoRA) [19] freezes all pretrained parameters in a method based on low-rank decomposition to train a rescoring the LLM and inserts a trainable pair of matrices (acting as a BERT model and adapt it to new domains using only a low-rank decomposition of a full matrix) additively into each fraction (0.08%) of the pretrained parameters. These inserted layer of the Transformer architecture. Compared to other matrices are optimized through a discriminative training objective parameter-efficient training methods, such as adapters [12], along with a correlation-based regularization loss. The LoRA has two distinct advantages: 1) it employs a simple proposed low-rank adaptation RescoreBERT (LoRB) architecture architecture and has the potential to reduce the number of is evaluated on LibriSpeech and internal datasets with trainable parameters compared to alternatives; 2) LoRA does decreased training times by factors between 5.4 and 3.6.
Personalization for BERT-based Discriminative Speech Recognition Rescoring
Kolehmainen, Jari, Gu, Yile, Gourav, Aditya, Shivakumar, Prashanth Gurunath, Gandhe, Ankur, Rastrow, Ariya, Bulyko, Ivan
Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized named entities to compare these approaches. On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline. We also show that on this test set, natural language prompts can improve word error rate by 7% without any training and with a marginal loss in generalization. Overall, gazetteers were found to perform the best with a 10% improvement in word error rate (WER), while also improving WER on a general test set by 1%.
Streaming Speech-to-Confusion Network Speech Recognition
Filimonov, Denis, Pandey, Prabhat, Rastrow, Ariya, Gandhe, Ankur, Stolcke, Andreas
In interactive automatic speech recognition (ASR) systems, low-latency requirements limit the amount of search space that can be explored during decoding, particularly in end-to-end neural ASR. In this paper, we present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency, as needed for interactive applications. We show that 1-best results of our model are on par with a comparable RNN-T system, while the richer hypothesis set allows second-pass rescoring to achieve 10-20\% lower word error rate on the LibriSpeech task. We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition
Fu, Xuandi, Sathyendra, Kanthashree Mysore, Gandhe, Ankur, Liu, Jing, Strimel, Grant P., McGowan, Ross, Mouchtaris, Athanasios
Attention-based contextual biasing approaches have shown significant improvements in the recognition of generic and/or personal rare-words in End-to-End Automatic Speech Recognition (E2E ASR) systems like neural transducers. These approaches employ cross-attention to bias the model towards specific contextual entities injected as bias-phrases to the model. Prior approaches typically relied on subword encoders for encoding the bias phrases. However, subword tokenizations are coarse and fail to capture granular pronunciation information which is crucial for biasing based on acoustic similarity. In this work, we propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing guided by acoustic similarity between the audio and the contextual entities (termed acoustic biasing). We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context along with contextual entities to perform biasing informed by the utterance's semantic context (termed semantic biasing). Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes over the baseline contextual model when incorporating our proposed acoustic and semantic biasing approach. On a large-scale in-house dataset, we observe 7.91% relative WER improvement compared to our baseline model. On tail utterances, the improvements are even more pronounced with 36.80% and 23.40% relative WER improvements on Librispeech rare words and an in-house testset respectively.
On-the-fly Text Retrieval for End-to-End ASR Adaptation
Yusuf, Bolaji, Gourav, Aditya, Gandhe, Ankur, Bulyko, Ivan
End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model with a retrieval language model, which directly retrieves from an external text corpus plausible completions for a partial ASR hypothesis. These completions are then integrated into subsequent predictions by an adapter, which is trained once, so that the corpus of interest can be switched without incurring the computational overhead of retraining. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets. Further, it outperforms shallow fusion on recognition of named entities by about 7 relative; when the two are combined, the relative improvement increases to 13%.