Ren, Roger
Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition
Yu, Yu, Yang, Chao-Han Huck, Dinh, Tuan, Ryu, Sungho, Kolehmainen, Jari, Ren, Roger, Filimonov, Denis, Shivakumar, Prashanth G., Gandhe, Ankur, Rastow, Ariya, Xu, Jia, Bulyko, Ivan, Stolcke, Andreas
The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
Yu, Yu, Yang, Chao-Han Huck, Kolehmainen, Jari, Shivakumar, Prashanth G., Gu, Yile, Ryu, Sungho, Ren, Roger, Luo, Qi, Gourav, Aditya, Chen, I-Fan, Liu, Yi-Chieh, Dinh, Tuan, Gandhe, Ankur, Filimonov, Denis, Ghosh, Shalini, Stolcke, Andreas, Rastow, Ariya, Bulyko, Ivan
However, as the size of the pretrained models increases, the cost associated We propose a neural language modeling system based on with fine-tuning and deploying these models for low-rank adaptation (LoRA) for speech recognition output real-world applications also escalates. To address this practical rescoring. Although pretrained language models (LMs) challenge, a range of parameter-efficient methods (e.g., like BERT have shown superior performance in second-pass adapters, model reprogramming, and prompts) have been proposed rescoring, the high computational cost of scaling up the pretraining [11, 12, 13, 14, 15, 16, 17, 18] to alleviate the computation stage and adapting the pretrained models to specific and memory demands of fine-tuning LLMs. Low-rank domains limit their practical use in rescoring. Here we present adaptation (LoRA) [19] freezes all pretrained parameters in a method based on low-rank decomposition to train a rescoring the LLM and inserts a trainable pair of matrices (acting as a BERT model and adapt it to new domains using only a low-rank decomposition of a full matrix) additively into each fraction (0.08%) of the pretrained parameters. These inserted layer of the Transformer architecture. Compared to other matrices are optimized through a discriminative training objective parameter-efficient training methods, such as adapters [12], along with a correlation-based regularization loss. The LoRA has two distinct advantages: 1) it employs a simple proposed low-rank adaptation RescoreBERT (LoRB) architecture architecture and has the potential to reduce the number of is evaluated on LibriSpeech and internal datasets with trainable parameters compared to alternatives; 2) LoRA does decreased training times by factors between 5.4 and 3.6.
Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition
Yang, Chao-Han Huck, Ahmed, Zeeshan, Gu, Yile, Szurley, Joseph, Ren, Roger, Liu, Linda, Stolcke, Andreas, Bulyko, Ivan
In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without accessing gradient information of a targeted ASR model directly. We propose an advanced Bayesian neural network (BNN) based adversarial detector, which could model latent distributions against adaptive adversarial perturbation with divergence measurement. We further simulate deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR systems with the proposed adversarial detection system. Leveraging the proposed BNN based detection system, we improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets compared to the current model enhancement methods against the adversarial speech examples.