AITopics | Peyser, Cal

Plotting

Peyser, Cal

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Text Injection for Neural Contextual Biasing

Meng, Zhong, Wu, Zelin, Prabhavalkar, Rohit, Peyser, Cal, Wang, Weiran, Chen, Nanxin, Sainath, Tara N., Ramabhadran, Bhuvana

arXiv.org Artificial IntelligenceJun-11-2024

Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model's attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.

bias phrase, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.02921

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Improving Joint Speech-Text Representations Without Alignment

Peyser, Cal, Meng, Zhong, Hu, Ke, Prabhavalkar, Rohit, Rosenberg, Andrew, Sainath, Tara N., Picheny, Michael, Cho, Kyunghyun

arXiv.org Artificial IntelligenceAug-11-2023

The last year has seen astonishing progress in text-prompted image generation premised on the idea of a cross-modal representation space in which the text and image domains are represented jointly. In ASR, this idea has found application as joint speech-text encoders that can scale to the capacities of very large parameter models by being trained on both unpaired speech and text. While these methods show promise, they have required special treatment of the sequence-length mismatch inherent in speech and text, either by up-sampling heuristics or an explicit alignment model. In this work, we offer evidence that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length, and argue that consistency losses could forgive length differences and simply assume the best alignment. We show that such a loss improves downstream WER in both a large-parameter monolingual and multilingual system.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2308.06125

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback

A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Peyser, Cal, Picheny, Michael, Cho, Kyunghyun, Prabhavalkar, Rohit, Huang, Ronny, Sainath, Tara

arXiv.org Artificial IntelligenceApr-19-2023

Unlike previous work, we apply these methods to a state-of-the-art, 160M-parameter streaming Conformer [7] Unpaired text and audio injection have emerged as dominant methods model that is already trained on a very large supervised corpus. We for improving ASR performance in the absence of a large labeled further depart from previous work by training supervised and unsupervised corpus. However, little guidance exists on deploying these methods tasks jointly, which is being increasingly shown to be to improve production ASR systems that are trained on very large supervised preferable to the conventional fine-tuning approach on very large corpora and with realistic requirements like a constrained datasets [8]. We find that under these conditions, none of the studied model size and CPU budget, streaming capability, and a rich lattice methods improve general WER at all. However, we report improvements for rescoring and for downstream NLU tasks. In this work, we compare in the decoder's computational load and in lattice density, three state-of-the-art semi-supervised methods encompassing as well as in several targeted WER measurements assessing performance both unpaired text and audio as well as several of their combinations on known categories of particularly difficult utterances. in a controlled setting using joint training. We find that in our setting Through this comparison and analysis, we hope to offer a more nuanced these methods offer many improvements beyond raw WER, including and comprehensive view of the usefulness of unpaired audio substantial gains in tail-word WER, decoder computation during and text in industrial ASR.

artificial intelligence, machine learning, representation, (19 more...)

arXiv.org Artificial Intelligence

2304.11053

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

Huang, W. Ronny, Chang, Shuo-Yiin, Sainath, Tara N., He, Yanzhang, Rybach, David, David, Robert, Prabhavalkar, Rohit, Allauzen, Cyril, Peyser, Cal, Strohman, Trevor D.

arXiv.org Artificial IntelligenceMar-5-2023

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a novel dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

machine learning, natural language, segmenter, (16 more...)

arXiv.org Artificial Intelligence

2211.15432

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (0.96)
Information Technology > Artificial Intelligence > Natural Language (0.69)
Information Technology > Artificial Intelligence > Machine Learning (0.68)

Add feedback

Dual Learning for Large Vocabulary On-Device ASR

Peyser, Cal, Huang, Ronny, Sainath, Tara, Prabhavalkar, Rohit, Picheny, Michael, Cho, Kyunghyun

arXiv.org Artificial IntelligenceJan-11-2023

Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address only the case of using unpaired examples to compensate for very small supervised datasets, and mostly on large, non-streaming models. Dual learning has not yet been proven effective for using unsupervised data to improve realistic on-device streaming models that are already trained on large supervised corpora. We provide this missing piece though an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2301.04327

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Improving Performance of End-to-End ASR on Numeric Sequences

Peyser, Cal, Zhang, Hao, Sainath, Tara N., Wu, Zelin

arXiv.org Machine LearningJul-1-2019

Recognizing written domain numeric utterances (e.g. I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-of-vocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g. I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low memory setting of on-device speech recognition. E2E models such as RNN-T are attractive for on-device ASR, as they fold the AM, PM and LM of a conventional model into one neural network. However, in the on-device setting the large memory footprint of an FST denormer makes spoken domain training more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that using a text-to-speech system to generate additional numeric training data, as well as using a small-footprint neural network to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest numeric sequences, we see reduction of WER by up to a factor of 8.

deep learning, neural network, utterance, (17 more...)

arXiv.org Machine Learning

1907.01372

Country: North America > United States (0.46)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback