AITopics

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Neural Information Processing SystemsFeb-17-2026, 15:45:49 GMT

b6145836acf918800b4b025f842512de-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (19 more...)

Country:

North America > United States (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > South Korea > Incheon > Incheon (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Neural Information Processing SystemsFeb-7-2026, 18:15:34 GMT

1b4839ff1f843b6be059bd0e8437e975-Paper-Conference.pdf

We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing thelabel biasproblem instreaming speech recognition. Oursolution admits a tractable exact computation of the denominator for the sequence-level normalization.

alignment lattice, artificial intelligence, machine learning, (16 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Technology:

Information Technology > Artificial Intelligence > Speech (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceNov-7-2025

WST: Weakly Supervised Transducer for Automatic Speech Recognition

Gao, Dongji, Liao, Chenda, Liu, Changliang, Wiesner, Matthew, Garcia, Leibny Paola, Povey, Daniel, Khudanpur, Sanjeev, Wu, Jian

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

artificial intelligence, deep learning, machine learning, (17 more...)

2511.04035

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Neural Information Processing SystemsOct-10-2025, 14:11:32 GMT

b6145836acf918800b4b025f842512de-Paper-Conference.pdf

alignment, experiment, recognition, (15 more...)

Country:

North America > United States (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > South Korea > Incheon > Incheon (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Andrusenko, Andrei, Bataev, Vladimir, Grigoryan, Lilit, Lavrukhin, Vitaly, Ginsburg, Boris

TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree

arXiv.org Artificial IntelligenceAug-13-2025

--Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit. Modern end-to-end automatic speech recognition (ASR) systems, such as Connectionist Temporal Classification (CTC) [1], Recurrent Neural Transducer (RNN-T) [2], and Attention Encoder-Decoder (AED) [3], already achieve relatively high speech recognition accuracy in common data domains [4].

artificial intelligence, machine learning, natural language, (21 more...)

2508.07014

Genre: Research Report (0.90)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Grigoryan, Lilit, Bataev, Vladimir, Andrusenko, Andrei, Xu, Hainan, Lavrukhin, Vitaly, Ginsburg, Boris

Pushing the Limits of Beam Search Decoding for Transducer-based ASR models

arXiv.org Artificial IntelligenceJun-3-2025

Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.

artificial intelligence, beam search, hypothesis, (17 more...)

2506.00185

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)

Neural Information Processing SystemsMay-27-2025, 13:42:37 GMT

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

aligner-encoder, self-attention transformer, self-transducer, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.62)
Information Technology > Artificial Intelligence > Machine Learning (0.42)

arXiv.org Artificial IntelligenceApr-10-2025

RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

Bataev, Vladimir

We start with the template "CM3015 Machine Learning and Neural Networks, Theme 1: Deep Learning on a public dataset," which describes the task of choosing a publicly available dataset and training a deep learning model on it. So, we will work with a neural network-based end-to-end ASR system, using LibriSpeech [2] dataset, a popular academic benchmark. We limit our task to RNN-T ransducer [3] systems, which are widely used in production and provide state-of-the-art quality [4] in most cases. We are going beyond the standard task and focusing our research on making RNN-T ransducer systems robust to noisy targets: unlike well-curated datasets, in the industry, the training data contains different errors due to the unreliability of the transcription sources or the inability to transcribe noisy speech accurately . T o solve the problem of training on the noisy data, we will analyze the impact of different types of errors in training data on the quality of the RNN-T ransducer system and explore different loss modifications to overcome the problem. We will construct the artificial training data by mutating correct transcripts from the LibriSpeech [2] training part, similar to the approaches used in the related work, and try to achieve the best possible quality on the development and test data standard for LibriSpeech.

artificial intelligence, deep learning, machine learning, (19 more...)

2504.06963

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Stooke, Adam, Prabhavalkar, Rohit, Sim, Khe Chai, Mengibar, Pedro Moreno

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

arXiv.org Artificial IntelligenceFeb-6-2025

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".

artificial intelligence, machine learning, natural language, (17 more...)

2502.05232

Country:

North America > United States (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > South Korea > Incheon > Incheon (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)