ENA: Efficient N-dimensional Attention

Zhong, Yibo

arXiv.org Artificial Intelligence 

TL;DR: A layer-interleaved hybrid architecture of linear recurrence and attention (full or local) matches Transformer performance on high-order data with greater efficiency. Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformer. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for language modeling, to high-order data (1D to ND): scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention and find that tiled high-order sliding window attention (SW A) is efficient in both theory and practice. We then conduct several experiments to demonstrate its effectiveness. The intuition behind ENA is that linear recurrence compresses global information into a state, while SW A complements it by enforcing strict local modeling. Together, they form a simple framework that offers a promising and practical solution for ultra-long high-order data modeling. Note that although we ultimately perform no sequence permutation, the framework remains compatible with any scanning. Softmax attention in LLMs has quadratic time complexity, making it inefficient for long sequences. Representative variants include RetNet Sun et al. (2023), HGRN Qin et al. (2024), GLA Y ang et al. (2024a), GSA Zhang et al. (2024), Mamba Gu & Dao (2023) and RWKV Peng et al. (2023). Subsequent advancements such as DeltaNet Y ang et al. (2024b), Gated DeltaNet Y ang et al. (2025), DeltaProduct Siems et al. (2025), LaCT Zhang et al. (2025b) and MesaNet von Oswald et al. (2025) further enhance expressiveness while preserving linear-time complexity and are optimized for parallel training. In this paper, "linear recurrent models", "linear recurrence" and "linear models" are used interchangeably to denote models that perform sequence modeling via state updates with linear time complexity. The same applies to "softmax attention" and "full attention". For convenience, we refer to all data with more than one dimension as higher-order data, including images, which have only two dimensions.