cass-nat
UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models
Fan, Ruchao, Shanka, Natarajan Balaji, Alwan, Abeer
Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.
A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition
Fan, Ruchao, Chu, Wei, Chang, Peng, Alwan, Abeer
Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a ~24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.
An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition
Fan, Ruchao, Chu, Wei, Chang, Peng, Xiao, Jing, Alwan, Abeer
In addition, Fujita et al. used the idea of the insertion Non-autoregressive mechanisms can significantly decrease inference transformer from NMT to generate the output sequence time for speech transformers, especially when the single with an arbitrary order [12]. Another recent effective method step variant is applied. Previous work on CTC alignmentbased is using multiple decoders as refiners to do an iterative refinement single step non-autoregressive transformer (CASS-NAT) based on CTC alignments [14]. Theoretically, the iterative has shown a large real time factor (RTF) improvement over autoregressive NAT has a limited improvement of inference speed since multiple transformers (AT). In this work, we propose several iterations are still needed to obtain a competitive result. In methods to improve the accuracy of the end-to-end CASS-contrast, single step NAT, which attempts to generate the output NAT, followed by performance analyses. First, convolution sequence with only one iteration, can have a better speed up augmented self-attention blocks are applied to both the encoder for inference. The idea is to substitute the word embedding in and decoder modules. Second, we propose to expand the trigger autoregressive models with an acoustic representation for each mask (acoustic boundary) for each token to increase the robustness output token, assuming that language semantics can also be captured of CTC alignments.