Extracting Compact Recurrences From Convolutions
–Neural Information Processing Systems
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in many domains, but incur a significant cost during auto-regressive inference workloads - naively requiring a full pass (or caching of activations) over the input sequence for each generated token - similarly to attention-based models. In this paper, we seek to enable O (1) compute and memory cost per token in any pre-trained long convolution architecture to reduce memory footprint and increase throughput during generation. Concretely, our methods consist in extracting low-dimensional linear state-space models from each convolution layer, building upon rational interpolation and model-order reduction techniques. We further introduce architectural improvements to convolution-based layers such as Hyena: by weight-tying the filters across channels into heads, we achieve higher pre-training quality and reduce the number of filters to be distilled. The resulting model achieves 10 higher throughput than Transformers and 1 .5 higher than Hyena at 1 .3
Neural Information Processing Systems
Nov-15-2025, 06:32:51 GMT
- Country:
- Asia
- Europe
- Netherlands > North Holland
- Amsterdam (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Netherlands > North Holland
- North America
- Canada > Quebec
- Montreal (0.04)
- United States > Illinois
- Cook County > Chicago (0.04)
- Canada > Quebec
- Industry:
- Technology: