Goto

Collaborating Authors

 ttention


RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

Wang, Bailin, Lan, Chang, Wang, Chong, Pang, Ruoming

arXiv.org Artificial Intelligence

Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches. We open-sourced our Pallas kernels along with model codes to facilitate further research effort.


LASH A

Neural Information Processing Systems

Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory.



Evaluating Saliency Explanations in NLP by Crowdsourcing

Lu, Xiaotian, Li, Jiyi, Wan, Zhen, Lin, Xiaofeng, Takeuchi, Koh, Kashima, Hisashi

arXiv.org Artificial Intelligence

Deep learning models have performed well on many NLP tasks. However, their internal mechanisms are typically difficult for humans to understand. The development of methods to explain models has become a key issue in the reliability of deep learning models in many important applications. Various saliency explanation methods, which give each feature of input a score proportional to the contribution of output, have been proposed to determine the part of the input which a model values most. Despite a considerable body of work on the evaluation of saliency methods, whether the results of various evaluation metrics agree with human cognition remains an open question. In this study, we propose a new human-based method to evaluate saliency methods in NLP by crowdsourcing. We recruited 800 crowd workers and empirically evaluated seven saliency methods on two datasets with the proposed method. We analyzed the performance of saliency methods, compared our results with existing automated evaluation methods, and identified notable differences between NLP and computer vision (CV) fields when using saliency methods. The instance-level data of our crowdsourced experiments and the code to reproduce the explanations are available at https://github.com/xtlu/lreccoling_evaluation.


SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

Wang, Zihao, Gan, Shaoduo

arXiv.org Artificial Intelligence

Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions. Based on our observations regarding layer-wise importance in inference, we propose SqueezeAttention to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three representative token sparsification algorithms to compress the KV-cache for each layer with its very own budget. By optimizing the KV-cache from both sequence's and layer's dimensions, SqueezeAttention achieves around 30% to 70% of the memory reductions and up to 2.2 times of throughput improvements in a wide range of LLMs and benchmarks. The code is available at https://github.com/hetailang/SqueezeAttention.


The Devil in Linear Transformer

Qin, Zhen, Han, XiaoDong, Sun, Weixuan, Li, Dongxu, Kong, Lingpeng, Barnes, Nick, Zhong, Yiran

arXiv.org Artificial Intelligence

Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .


Asking Friendly Strangers: Non-Semantic Attribute Transfer

Murrugarra-Llerena, Nils (University of Pittsburgh) | Kovashka, Adriana (University of Pittsburgh)

AAAI Conferences

Nickisch, and Harmeling 2009; Parikh and Grauman We propose an attention-guided transfer network. Briefly, 2011; Akata et al. 2013), learn object models expediently our approach works as follows. First, the network receives by providing information about multiple object classes training images for attributes in both the source and target with each attribute label (Kovashka, Vijayanarasimhan, and domains. Second, it separately learns models for the attributes Grauman 2011; Parkash and Parikh 2012), interactively recognize in each domain, and then measures how related each fine-grained object categories (Branson et al. 2010; target domain classifier is to the classifiers in the source domains. Wah and Belongie 2013), and learn to retrieve images from Finally, it uses these measures of similarity (relatedness) precise human feedback (Kumar et al. 2011; Kovashka, to compute a weighted combination of the source classifiers, Parikh, and Grauman 2015). Recent ConvNet approaches which then becomes the new classifier for the target have shown how to learn accurate attribute models through attribute. We develop two methods, one where the target and multi-task learning (Fouhey, Gupta, and Zisserman 2016; source domains are disjoint, and another where there is some Huang et al. 2015) or by localizing attributes (Xiao and overlap between them. Importantly, we show that when the Jae Lee 2015; Singh and Lee 2016). However, deep learning source attributes come from a diverse set of domains, the with ConvNets requires a large amount of data to be available gain we obtain from this transfer of knowledge is greater for the task of interest, or for a related task (Oquab et than if only use attributes from the same domain.