Appendix: Remodel Self-Attention with Gaussian Kernel and Nyström Method

Apr-24-2026, 18:12:25 GMT–Neural Information Processing Systems

Y-axis: Cross Entropy Loss on validation set. Figure 1 shows the validation loss changes with respect to training time for 50k steps as supplementary results for the experiments in Section 5. In general, Skyformer converges faster and finishes 50k steps earlier than vanilla Attention and Kernelized Attention over all tasks. We further remark that on Text Classification, all models quickly fall into over-fitting, and thus the validation losses rise quickly. On Pathfinder, due to the difficulty of training, in the trial shown in the figure vanilla Attention fails to reach the best long-time limit under a certain setting. Figure 2 shows the singular value distribution of attention output from the second layer of a trained vanilla transformer.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Apr-24-2026, 18:12:25 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > Illinois (0.15)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.90)
  - Machine Learning (0.69)

Duplicate Docs Excel Report

Title
Appendix: RemodelSelf-AttentionwithGaussian KernelandNyströmMethod

Similar Docs Excel Report more

Title	Similarity	Source
None found