AITopics | attention distillation

Collaborating Authors

attention distillation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Surprising Effectiveness of Attention Transfer for Vision Transformers Alexander C. Li

Neural Information Processing SystemsFeb-18-2026, 05:22:07 GMT

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On the Surprising Effectiveness of Attention Transfer for Vision Transformers Alexander C. Li

Neural Information Processing SystemsOct-10-2025, 17:01:08 GMT

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations.

attention distillation, attention map, attention transfer, (14 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation

Ding, Yidong, Niu, Jiafei, Yi, Ping

arXiv.org Artificial IntelligenceJan-5-2025

In recent years, attention-based models have excelled across various domains but remain vulnerable to backdoor attacks, often from downloading or fine-tuning on poisoned datasets. Many current methods to mitigate backdoors in NLP models rely on the pre-trained (unfine-tuned) weights, but these methods fail in scenarios where the pre-trained weights are not available. In this work, we propose MBTSAD, which can mitigate backdoors in the language model by utilizing only a small subset of clean data and does not require pre-trained weights. Specifically, MBTSAD retrains the backdoored model on a dataset generated by token splitting. Then MBTSAD leverages attention distillation, the retrained model is the teacher model, and the original backdoored model is the student model. Experimental results demonstrate that MBTSAD achieves comparable backdoor mitigation performance as the methods based on pre-trained weights while maintaining the performance on clean data. MBTSAD does not rely on pre-trained weights, enhancing its utility in scenarios where pre-trained weights are inaccessible. In addition, we simplify the min-max problem of adversarial training and visualize text representations to discover that the token splitting method in MBTSAD's first step generates Out-of-Distribution (OOD) data, leading the model to learn more generalized features and eliminate backdoor patterns.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.02754

Country: Asia (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Li, Alexander C., Tian, Yuandong, Chen, Beidi, Pathak, Deepak, Chen, Xinlei

arXiv.org Artificial IntelligenceNov-14-2024

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning

attention distillation, attention map, attention transfer, (11 more...)

arXiv.org Artificial Intelligence

2411.09702

Genre: Research Report > New Finding (0.66)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Zhang, Michael, Bhatia, Kush, Kumbong, Hermann, Ré, Christopher

arXiv.org Artificial IntelligenceFeb-6-2024

Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetuned-conversion" of task-specific Transformers into linear versions that recover task performance, and (3) "pretrained-conversion" of Transformers such as large language models into linear versions finetunable on downstream tasks. However, linear attentions often underperform standard softmax attention in quality. To close this performance gap, we find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or "spiky") weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties and match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also enables pretrained-conversion. Converting a pretrained GPT-2 into a linear attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for 125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B achieves 28.1 higher ROUGE-1 points over the base standard attention model, where prior linear attentions lead to 16.5 point drops.

attention weight, feature map, linear attention, (13 more...)

arXiv.org Artificial Intelligence

2402.04347

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > China > Hong Kong (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Semiconductors & Electronics (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer

Zhao, Kaikai, Ukita, Norimichi

arXiv.org Artificial IntelligenceMar-16-2023

Scaled dot-product attention applies a softmax function on the scaled dot-product of queries and keys to calculate weights and then multiplies the weights and values. In this work, we study how to improve the learning of scaled dot-product attention to improve the accuracy of DETR. Our method is based on the following observations: using ground truth foreground-background mask (GT Fg-Bg Mask) as additional cues in the weights/values learning enables learning much better weights/values; with better weights/values, better values/weights can be learned. We propose a triple-attention module in which the first attention is a plain scaled dot-product attention, the second/third attention generates high-quality weights/values (with the assistance of GT Fg-Bg Mask) and shares the values/weights with the first attention to improve the quality of values/weights. The second and third attentions are removed during inference. We call our method knowledge-sharing DETR (KS-DETR), which is an extension of knowledge distillation (KD) in the way that the improved weights and values of the teachers (the second and third attentions) are directly shared, instead of mimicked, by the student (the first attention) to enable more efficient knowledge transfer from the teachers to the student. Experiments on various DETR-like methods show consistent improvements over the baseline methods on the MS COCO benchmark. Code is available at https://github.com/edocanonymous/KS-DETR.

knowledge management, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2302.11208

Country: Asia > Japan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Knowledge Management (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback