Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions

Pei, Yan Ru

arXiv.org Artificial Intelligence 

We introduce Centaurus, a class of networks composed of generalized state-space model (SSM) blocks, where the SSM operations can be treated as tensor contractions during training. The optimal order of tensor contractions can then be systematically determined for every SSM block to maximize training efficiency. This allows more flexibility in designing SSM blocks beyond the depthwise-separable configuration commonly implemented. The new design choices will take inspiration from classical convolutional blocks including group convolutions, full convolutions, and bottleneck blocks. We architect the Centaurus network with a mixture of these blocks, to balance between network size and performance, as well as memory and computational efficiency during both training and inference. We show that this heterogeneous network design outperforms its homogeneous counterparts in raw audio processing tasks including keyword spotting, speech denoising, and automatic speech recognition (ASR). For ASR, Centaurus is the first network with competitive performance that can be made fully state-space based, without using any nonlinear recurrence (LSTMs), explicit convolutions (CNNs), or (surrogate) attention mechanism. Sequence or temporal modeling encompasses a wide range of tasks from audio processing to language modeling. Traditionally, there have been many (related) statistical methods employed (Box et al., 2015). In the age of deep learning, neural networks have been predominantly used (LeCun et al., 2015), including recurrent neural networks (RNNs), convolutional neural networks (CNNs), transformers (Vaswani, 2017), and neural ODEs (Chen et al., 2018). In many cases, the model will inevitably suffer from one of two drawbacks: 1) cannot be efficiently trained (or fitted) in parallel due to the sequential nature of the model, 2) cannot be efficiently configured for online inference due to its large memory and computational requirement. To address this, deep state-space models (SSMs) were adapted for sequence modeling, and have shown incredible potential across a wide range of tasks (Gu et al., 2021; Goel et al., 2022; Gu & Dao, 2023). Due to the linearity of the SSM layers, they can not only be configured for efficient online inference with small memory and computational resources, but also configured for efficient training using parallel hardware with unrolling strategies (Gu et al., 2022; Smith et al., 2022; Dao & Gu, 2024; Heinsen, 2023). Currently, most deep SSM networks (along with most neural networks in general) follow the architectural recipe of transformers, where they are composed of uniform "SSM blocks" throughout the network, containing little to no variations in the shapes of the intermediate features or weights. This simplifies the designs of deep SSM networks, but may sacrifice performance and efficiency in practice.