AITopics | squeezeformer

3ccf6da39eeb8fefc8bbb1b0124adbd1-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 11:25:15 GMT

conformer-ctc, epoch, variant, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

AutomaticSpeechRecognition

Neural Information Processing SystemsFeb-8-2026, 11:25:11 GMT

Furthermore, it has also achievedstate-of-the-art performance incombination with recent developments inself-supervised learning methodologies as well [37,62].

artificial intelligence, arxivpreprintarxiv, machine learning, (18 more...)

Neural Information Processing Systems

Country: South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Neural Information Processing SystemsDec-24-2025, 02:01:22 GMT

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

automatic speech recognition, efficient transformer, squeezeformer, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

A Appendix

Neural Information Processing SystemsAug-14-2025, 07:49:59 GMT

We evaluate the final models on both clean and other datasets using CTC greedy decoding. Here, we use CUDA 11.5 and Tensorflow 2.5, and test with the largest In both cases, we used the same Sentence-Piece tokenizer as Librispeech training. All the other training recipes are the same as A.1. As can be seen in the table, the general trend aligns with the LibriSpeech results in Tab.

conformer-ctc, epoch, variant, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Neural Information Processing SystemsAug-14-2025, 07:49:52 GMT

Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal.

architecture, arxiv preprint arxiv, recognition, (15 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Neural Information Processing SystemsOct-10-2024, 17:50:57 GMT

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal.

automatic speech recognition, efficient transformer, squeezeformer, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.85)

Add feedback

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Jeon, Sungho, Yeh, Ching-Feng, Inan, Hakan, Hsu, Wei-Ning, Rungta, Rashi, Mehdad, Yashar, Bikel, Daniel

arXiv.org Artificial IntelligenceFeb-8-2024

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.

efficiency, module, transformer, (16 more...)

arXiv.org Artificial Intelligence

2311.02772

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)
Information Technology > Artificial Intelligence > Natural Language (0.85)

Add feedback

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Kim, Sehoon, Gholami, Amir, Shaw, Albert, Lee, Nicholas, Mangalam, Karttikeya, Malik, Jitendra, Mahoney, Michael W., Keutzer, Kurt

arXiv.org Artificial IntelligenceOct-15-2022

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise downsampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online [25].

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2206.00888

Country: