AITopics | self-attention module

Collaborating Authors

self-attention module

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Pay Attention to MLPs

Neural Information Processing SystemsApr-25-2026, 19:27:46 GMT

Transformers [1] have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SupplementaryMaterial: UnifiedVision-Language Pre-TrainingwithMixture-of-Modality-Experts

Neural Information Processing SystemsFeb-12-2026, 03:19:13 GMT

We perform finetuning with image-textcontrastiveand image-textmatching losses. During inference, VLMO is first used as a dual encoder to obtain top-k candidates, then the model is used as a fusionencoder torerankthecandidates. For the text-only pre-training data, we use English Wikipedia and BookCorpus [5]. Table 1: Ablation study of the shared self-attention module used in Multiway Transformer.

artificial intelligence, machine learning, supplementarymaterial, (10 more...)

Neural Information Processing Systems

Country: North America > United States > Maryland > Baltimore (0.06)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Activating Self-Attention for Multi-Scene Absolute Pose Regression

Neural Information Processing SystemsFeb-12-2026, 01:21:57 GMT

Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 04:37:04 GMT

distillation, student, student model, (14 more...)

Neural Information Processing Systems

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(6 more...)

Industry: Education (0.73)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Neural Information Processing SystemsDec-23-2025, 23:32:24 GMT

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works.

deep self-attention distillation, name change, task-agnostic compression, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.59)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

Activating Self-Attention for Multi-Scene Absolute Pose Regression

Neural Information Processing SystemsOct-10-2025, 00:45:31 GMT

Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments.

dataset, experiment, query region, (14 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

LM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Neural Information Processing SystemsOct-2-2025, 18:08:19 GMT

The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher).

distillation, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.93)
Europe (0.68)

Industry: Education (0.73)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Error Correction Code Transformer

Neural Information Processing SystemsAug-19-2025, 21:58:27 GMT

We encode each channel's output dimension to a high dimension for better representation of the bits' information to be processed separately.

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Industry: Energy > Oil & Gas (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Low-Bit Integerization of Vision Transformers using Operand Reordering for Efficient Hardware

Lin, Ching-Yi, Shah, Sahil

arXiv.org Artificial IntelligenceAug-6-2025

Pre-trained vision transformers have achieved remarkable performance across various visual tasks but suffer from expensive computational and memory costs. While model quantization reduces memory usage by lowering precision, these models still incur significant computational overhead due to the dequantization before matrix operations. In this work, we analyze the computation graph and propose an integerization process based on operation reordering. Specifically, the process delays dequantization until after matrix operations. This enables integerized matrix multiplication and linear module by directly processing the quantized input. To validate our approach, we synthesize the self-attention module of ViT on a systolic array-based hardware. Experimental results show that our low-bit inference reduces per-PE power consumption for linear layer and matrix multiplication, bridging the gap between quantized models and efficient inference.

artificial intelligence, machine learning, matrix multiplication, (15 more...)

arXiv.org Artificial Intelligence

2504.18547

Country: North America > United States > Maryland (0.15)

Genre: Research Report > New Finding (0.67)

Technology: