AITopics | cass-nat

Collaborating Authors

cass-nat

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models

Fan, Ruchao, Shanka, Natarajan Balaji, Alwan, Abeer

arXiv.org Artificial IntelligenceFeb-13-2024

Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.

cass-nat, encoder, unienc-cassnat, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/LSP.2024.3365036

2402.08898

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (0.40)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Fan, Ruchao, Chu, Wei, Chang, Peng, Alwan, Abeer

arXiv.org Artificial IntelligenceApr-15-2023

Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a ~24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2023.3263789

2304.07611

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Fan, Ruchao, Chu, Wei, Chang, Peng, Xiao, Jing, Alwan, Abeer

arXiv.org Artificial IntelligenceJun-17-2021

In addition, Fujita et al. used the idea of the insertion Non-autoregressive mechanisms can significantly decrease inference transformer from NMT to generate the output sequence time for speech transformers, especially when the single with an arbitrary order [12]. Another recent effective method step variant is applied. Previous work on CTC alignmentbased is using multiple decoders as refiners to do an iterative refinement single step non-autoregressive transformer (CASS-NAT) based on CTC alignments [14]. Theoretically, the iterative has shown a large real time factor (RTF) improvement over autoregressive NAT has a limited improvement of inference speed since multiple transformers (AT). In this work, we propose several iterations are still needed to obtain a competitive result. In methods to improve the accuracy of the end-to-end CASS-contrast, single step NAT, which attempts to generate the output NAT, followed by performance analyses. First, convolution sequence with only one iteration, can have a better speed up augmented self-attention blocks are applied to both the encoder for inference. The idea is to substitute the word embedding in and decoder modules. Second, we propose to expand the trigger autoregressive models with an acoustic representation for each mask (acoustic boundary) for each token to increase the robustness output token, assuming that language semantics can also be captured of CTC alignments.

decoder, speech recognition, transformer, (15 more...)

arXiv.org Artificial Intelligence

2106.09885

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > France (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback