IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

Song, Zeyang, Zhang, Shimin, Chou, Yuhong, Wu, Jibin, Li, Haizhou

arXiv.org Artificial Intelligence 

Abstract--Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing tasks. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. T o overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulates multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64 and 4.32 respectively. The high computational cost of such models has motivated the search for energy-efficient alternatives. Zeyang Song and Haizhou Li are with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 119077 Haizhou Li is also with the School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, 518172 China; Shenzhen Loop Area Institute, Shenzhen, China.