Goto

Collaborating Authors

 partial hypothesis


Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition

arXiv.org Artificial Intelligence

Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives "bonus scores" to partial hypothesis (e.g. "Bon") that may lead to the generation of the rare word (e.g. "Bonham"). If the full word ("Bonham") isn't ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.


Iterative Shallow Fusion of Backward Language Model for End-to-End Speech Recognition

arXiv.org Artificial Intelligence

We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). The BLM has complementary characteristics with a forward language model (FLM), and the effectiveness of their combination has been confirmed by rescoring ASR hypotheses as post-processing. In the proposed SF, we iteratively apply the BLM to partial ASR hypotheses in the backward direction (i.e., from the possible next token to the start symbol) during decoding, substituting the newly calculated BLM scores for the scores calculated at the last iteration. To enhance the effectiveness of this iterative SF (ISF), we train a partial sentence-aware BLM (PBLM) using reversed text data including partial sentences, considering the framework of ISF. In experiments using an attention-based encoder-decoder ASR system, we confirmed that ISF using the PBLM shows comparable performance with SF using the FLM. By performing ISF, early pruning of prospective hypotheses can be prevented during decoding, and we can obtain a performance improvement compared to applying the PBLM as post-processing. Finally, we confirmed that, by combining SF and ISF, further performance improvement can be obtained thanks to the complementarity of the FLM and PBLM.


Implementing contextual biasing in GPU decoder for online ASR

arXiv.org Artificial Intelligence

GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model (LM) weights in offline and online CPU scenarios. In real-time GPU decoding, partial recognition hypotheses are produced without lattice generation, which makes the implementation of biasing more complex. The paper proposes and describes an approach to integrate contextual biasing in real-time GPU decoding while exploiting the standard Kaldi GPU decoder. Besides the biasing of partial ASR predictions, our approach also permits dynamic context switching allowing a flexible rescoring per each speech segment directly on GPU. The code is publicly released and tested with open-sourced test sets.


Personalized Predictive ASR for Latency Reduction in Voice Assistants

arXiv.org Artificial Intelligence

Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. We introduce two personalization approaches and investigate the tradeoff between potential latency gains from successful predictions and the cost increase from failed predictions. We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.


Common-Frame Model for Object Recognition

Neural Information Processing Systems

A generative probabilistic model for objects in images is presented. An object consists of a constellation of features. Feature appearance and pose are modeled probabilistically. Scene images are generated by draw- ing a set of objects from a given database, with random clutter sprinkled on the remaining image surface. We study the case where features from the same object share a common reference frame. Moreover, parameters for shape and appearance den- sities are shared across features.


Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI

arXiv.org Artificial Intelligence

Recently, End-to-End (E2E) frameworks have achieved remarkable results on various Automatic Speech Recognition (ASR) tasks. However, Lattice-Free Maximum Mutual Information (LF-MMI), as one of the discriminative training criteria that show superior performance in hybrid ASR systems, is rarely adopted in E2E ASR frameworks. In this work, we propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages. The proposed approach shows its effectiveness on two of the most widely used E2E frameworks including Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements on various datasets and different E2E ASR frameworks. The best of our models achieves competitive CER of 4.1\% / 4.4\% on Aishell-1 dev/test set; we also achieve significant error reduction on Aishell-2 and Librispeech datasets over strong baselines.


Search-based Methods to Bound Diagnostic Probabilities in Very Large Belief Nets

arXiv.org Artificial Intelligence

Since exact probabilistic inference is intractable in general for large multiply connected belief nets, approximate methods are required. A promising approach is to use heuristic search among hypotheses (instantiations of the network) to find the most probable ones, as in the TopN algorithm. Search is based on the relative probabilities of hypotheses which are efficient to compute. Given upper and lower bounds on the relative probability of partial hypotheses, it is possible to obtain bounds on the absolute probabilities of hypotheses. Best-first search aimed at reducing the maximum error progressively narrows the bounds as more hypotheses are examined. Here, qualitative probabilistic analysis is employed to obtain bounds on the relative probability of partial hypotheses for the BN20 class of networks networks and a generalization replacing the noisy OR assumption by negative synergy. The approach is illustrated by application to a very large belief network, QMR-BN, which is a reformulation of the Internist-1 system for diagnosis in internal medicine.


Evaluating and Improving Real-Time Tracking of Childrenโ€™s Oral Reading

AAAI Conferences

The accuracy of an automated reading tutor in tracking the readerโ€™s position is affected by phenomena at the frontier of the speech recognizerโ€™s output as it evolves in real time. We define metrics of real-time tracking accuracy computed from the recognizerโ€™s successive partial hypotheses, in contrast to previous metrics computed from the final hypothesis. We analyze the resulting considerable loss in real-time accuracy, and propose and evaluate a method to address it. Our method raises real-time accuracy from 58% to 70%, which should improve the quality of the tutorโ€™s feedback.


Common-Frame Model for Object Recognition

Neural Information Processing Systems

A generative probabilistic model for objects in images is presented. An object consists of a constellation of features. Feature appearance and pose are modeled probabilistically. Scene images are generated by drawing a set of objects from a given database, with random clutter sprinkled on the remaining image surface.


Common-Frame Model for Object Recognition

Neural Information Processing Systems

A generative probabilistic model for objects in images is presented. An object consists of a constellation of features. Feature appearance and pose are modeled probabilistically. Scene images are generated by drawing a set of objects from a given database, with random clutter sprinkled on the remaining image surface.