Goto

Collaborating Authors

 attention value







ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

arXiv.org Artificial Intelligence

--Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. T oken reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later . Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. T o address these challenges, in this paper, we introduce a novel T oken Freezing and Reusing (T oFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, T oFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that T oFe reduces the computational cost of L V-ViT model by 50% with less than 2% drop in T op-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods. Large-scale pre-trained vision transformer (ViT) models [37] have achieved remarkable progress in the field of vision tasks.


DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

arXiv.org Artificial Intelligence

Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.


Convolutional Rectangular Attention Module

arXiv.org Machine Learning

In this paper, we introduce a novel spatial attention module, that can be integrated to any convolutional network. This module guides the model to pay attention to the most discriminative part of an image. This enables the model to attain a better performance by an end-to-end training. In standard approaches, a spatial attention map is generated in a position-wise fashion. We observe that this results in very irregular boundaries. This could make it difficult to generalize to new samples. In our method, the attention region is constrained to be rectangular. This rectangle is parametrized by only 5 parameters, allowing for a better stability and generalization to new samples. In our experiments, our method systematically outperforms the position-wise counterpart. Thus, this provides us a novel useful spatial attention mechanism for convolutional models. Besides, our module also provides the interpretability concerning the ``where to look" question, as it helps to know the part of the input on which the model focuses to produce the prediction.


Cross-Encoder Rediscovers a Semantic Variant of BM25

arXiv.org Artificial Intelligence

Neural Ranking Models (NRMs) have rapidly advanced state-of-the-art performance on information retrieval tasks. In this work, we investigate a Cross-Encoder variant of MiniLM to determine which relevance features it computes and where they are stored. We find that it employs a semantic variant of the traditional BM25 in an interpretable manner, featuring localized components: (1) Transformer attention heads that compute soft term frequency while controlling for term saturation and document length effects, and (2) a low-rank component of its embedding matrix that encodes inverse document frequency information for the vocabulary. This suggests that the Cross-Encoder uses the same fundamental mechanisms as BM25, but further leverages their capacity to capture semantics for improved retrieval performance. The granular understanding lays the groundwork for model editing to enhance model transparency, addressing safety concerns, and improving scalability in training and real-world applications.


DepressionX: Knowledge Infused Residual Attention for Explainable Depression Severity Assessment

arXiv.org Artificial Intelligence

In today's interconnected society, social media platforms have become an important part of our lives, where individuals virtually express their thoughts, emotions, and moods. These expressions offer valuable insights into their mental health. This paper explores the use of platforms like Facebook, $\mathbb{X}$ (formerly Twitter), and Reddit for mental health assessments. We propose a domain knowledge-infused residual attention model called DepressionX for explainable depression severity detection. Existing deep learning models on this problem have shown considerable performance, but they often lack transparency in their decision-making processes. In healthcare, where decisions are critical, the need for explainability is crucial. In our model, we address the critical gap by focusing on the explainability of depression severity detection while aiming for a high performance accuracy. In addition to being explainable, our model consistently outperforms the state-of-the-art models by over 7% in terms of $\text{F}_1$ score on balanced as well as imbalanced datasets. Our ultimate goal is to establish a foundation for trustworthy and comprehensible analysis of mental disorders via social media.