AITopics | kernelized attention

Appendix: Remodel Self-Attention with Gaussian Kernel and Nyström Method

Neural Information Processing SystemsApr-24-2026, 18:12:25 GMT

Y-axis: Cross Entropy Loss on validation set. Figure 1 shows the validation loss changes with respect to training time for 50k steps as supplementary results for the experiments in Section 5. In general, Skyformer converges faster and finishes 50k steps earlier than vanilla Attention and Kernelized Attention over all tasks. We further remark that on Text Classification, all models quickly fall into over-fitting, and thus the validation losses rise quickly. On Pathfinder, due to the difficulty of training, in the trial shown in the figure vanilla Attention fails to reach the best long-time limit under a certain setting. Figure 2 shows the singular value distribution of attention output from the second layer of a trained vanilla transformer.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Illinois (0.15)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.90)
Information Technology > Artificial Intelligence > Machine Learning (0.69)

Add feedback

10a7cdd970fe135cf4f7bb55c0e3b59f-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 18:12:20 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Minnesota (0.28)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

c0f168ce8900fa56e57789e2a2f2c9d0-Supplemental.pdf

Neural Information Processing SystemsFeb-10-2026, 23:57:17 GMT

feature map dimension, kernelized attention, transformer, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
(3 more...)

Add feedback

c0f168ce8900fa56e57789e2a2f2c9d0-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 23:57:13 GMT

kernelized attention, rpe, transformer, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Appendix: RemodelSelf-AttentionwithGaussian KernelandNyströmMethod

Neural Information Processing SystemsFeb-7-2026, 13:05:48 GMT

Figure 1: Validation loss changes for50k steps. Consider a finite sequence{Xk} of independent, random, self-adjoint matrices with dimensionn. For a certainn-by-n orthogonal matrixH (HHT is a diagonal matrix) and ann-by-d uniform sub-sampling matrixS (as defined in Definition 1 in the main paper), we denote the sketching matrixΠ:= nS.WeaimtoshowHΠΠTHT cansatisfy(12,δ)-MApropertyforHHT bythe followinglemma. The first inequality of the preceding display holds due to the fact thatH is an orthogonal matrix. It is easy to check that C C =B(I PΠ)BT.

artificial intelligence, kernelized attention, matrix, (13 more...)

Neural Information Processing Systems

Country: North America > United States > Illinois (0.05)

Technology: Information Technology > Artificial Intelligence (0.48)

Add feedback

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Neural Information Processing SystemsDec-24-2025, 20:47:39 GMT

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.

kernelized attention, relative positional encoding, transformer, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.96)

Add feedback

c0f168ce8900fa56e57789e2a2f2c9d0-Supplemental.pdf

Neural Information Processing SystemsAug-17-2025, 04:46:02 GMT

kernelized attention, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
(3 more...)

Add feedback

c0f168ce8900fa56e57789e2a2f2c9d0-Paper.pdf

Neural Information Processing SystemsAug-17-2025, 04:45:58 GMT

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SchoenbAt: Rethinking Attention with Polynomial basis

Guo, Yuhan, Ding, Lizhong, Yang, Yuwan, Guo, Xuewei

arXiv.org Artificial IntelligenceMay-20-2025

Kernelized attention extends the attention mechanism by modeling sequence correlations through kernel functions, making significant progresses in optimizing attention. Under the guarantee of harmonic analysis theory, kernel functions can be expanded with basis functions, inspiring random feature-based approaches to enhance the efficiency of kernelized attention while maintaining predictive performance. However, current random feature-based works are limited to the Fourier basis expansions under Bochner's theorem. We propose Schoenberg's theorem-based attention (SchoenbAt), which approximates dot-product kernelized attention with the polynomial basis under Schoenberg's theorem via random Maclaurin features and applies a two-stage regularization to constrain the input space and restore the output scale, acting as a drop-in replacement of dot-product kernelized attention. Our theoretical proof of the unbiasedness and concentration error bound of SchoenbAt supports its efficiency and accuracy as a kernelized attention approximation, which is also empirically validated under various random feature dimensions. Evaluations on real-world datasets demonstrate that SchoenbAt significantly enhances computational speed while preserving competitive performance in terms of precision, outperforming several efficient attention methods.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.12252

Country: North America (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Neural Information Processing SystemsJan-18-2025, 23:57:10 GMT

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.

kernelized attention, relative positional encoding, transformer, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

kernelized attention

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Appendix: Remodel Self-Attention with Gaussian Kernel and Nyström Method

10a7cdd970fe135cf4f7bb55c0e3b59f-Paper.pdf

c0f168ce8900fa56e57789e2a2f2c9d0-Supplemental.pdf

c0f168ce8900fa56e57789e2a2f2c9d0-Paper.pdf

Appendix: RemodelSelf-AttentionwithGaussian KernelandNyströmMethod

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

c0f168ce8900fa56e57789e2a2f2c9d0-Supplemental.pdf

c0f168ce8900fa56e57789e2a2f2c9d0-Paper.pdf

SchoenbAt: Rethinking Attention with Polynomial basis

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding