local attention
RAT Bridging and Attention Accuracy via Chunk based Sequence Modeling
Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RATpartitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for longrange interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7 improvement in training speed for 100K sequence length and 9 in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short-and long-context benchmarks, as well as supervised finetuning (SFT). We further propose a hybrid architecture that interleaves RATwith local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results.
On Inductive Biases That Enable Generalization of Diffusion Transformers
Recent work studying the generalization of diffusion models with locally linear UNet-based denoisers reveals inductive biases that can be expressed via geometryadaptive harmonic bases. For such locally linear UNets, these geometry-adaptive harmonic bases can be conveniently visualized through the eigen-decomposition of a UNet's Jacobian matrix. In practice, however, more recent denoising networks are often transformer-based, e.g., the diffusion transformer (DiT). Due to the presence of nonlinear operations, similar eigen-decomposition analyses cannot be used to reveal the inductive biases of transformer-based denoisers. This motivates our search for alternative ways to explain the strong generalization ability observed in DiT models.
Associating Objects with Transformers for Video Object Segmentation
This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object.
cf78a15772ec1a6aee9bbee2d2b382c3-Supplemental-Conference.pdf
Our first step is to prove the parameterization (Eq. 3) provides local attention after the Note that the weight and bias terms in theaboveformulation (Eq. Assume the position-based function at each head is learned to perform'hard attention' on one of its surrounding positions,i.e., an extreme semi-dynamic attention. To demonstrate this phenomenon, we plot and compare the impacts ofฮฆc and ฮฆp6 on ฮฆa in the middle and right of Fig. S4 and visualize learned position-based attentionฮฆp of iRPE in Fig. S5. As seen from Tab. S17, there exist noticeable performance gaps between the models (b, f, g, h) (withoutฮฆp)and(a,d,e,i)(withฮฆp). Without adaptiveattention (model (c)),ฮฆp imposes stronger locality onevery layer.
RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling
Wei, Xiuying, Yadav, Anunay, Pascanu, Razvan, Gulcehre, Caglar
Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7$\times$ improvement in training speed for 100K sequence length and 9$times$ in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.