kernelized attention
Appendix: RemodelSelf-AttentionwithGaussian KernelandNyströmMethod
Figure 1: Validation loss changes for50k steps. Consider a finite sequence{Xk} of independent, random, self-adjoint matrices with dimensionn. For a certainn-by-n orthogonal matrixH (HHT is a diagonal matrix) and ann-by-d uniform sub-sampling matrixS (as defined in Definition 1 in the main paper), we denote the sketching matrixΠ:= nS.WeaimtoshowHΠΠTHT cansatisfy(12,δ)-MApropertyforHHT bythe followinglemma. The first inequality of the preceding display holds due to the fact thatH is an orthogonal matrix. It is easy to check that C C =B(I PΠ)BT.
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.
Appendix: Remodel Self-Attention with Gaussian Kernel and Nyström Method
Entropy Loss on validation set. We further remark that on Text Classification, all models quickly fall into over-fitting, and thus the validation losses rise quickly. Results are averaged across one random batch from the test set in each LRA task. Such matrices are considered more informative since they are harder to approximate, requiring more ranks even in the truncated SVD approximation. This section introduces some useful facts, which are key in the proof in the next section.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (10 more...)
SchoenbAt: Rethinking Attention with Polynomial basis
Guo, Yuhan, Ding, Lizhong, Yang, Yuwan, Guo, Xuewei
Kernelized attention extends the attention mechanism by modeling sequence correlations through kernel functions, making significant progresses in optimizing attention. Under the guarantee of harmonic analysis theory, kernel functions can be expanded with basis functions, inspiring random feature-based approaches to enhance the efficiency of kernelized attention while maintaining predictive performance. However, current random feature-based works are limited to the Fourier basis expansions under Bochner's theorem. We propose Schoenberg's theorem-based attention (SchoenbAt), which approximates dot-product kernelized attention with the polynomial basis under Schoenberg's theorem via random Maclaurin features and applies a two-stage regularization to constrain the input space and restore the output scale, acting as a drop-in replacement of dot-product kernelized attention. Our theoretical proof of the unbiasedness and concentration error bound of SchoenbAt supports its efficiency and accuracy as a kernelized attention approximation, which is also empirically validated under various random feature dimensions. Evaluations on real-world datasets demonstrate that SchoenbAt significantly enhances computational speed while preserving competitive performance in terms of precision, outperforming several efficient attention methods.
- North America > Canada > Ontario > Toronto (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Beijing > Beijing (0.04)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.