Appendix: Remodel Self-Attention with Gaussian Kernel and Nyström Method
–Neural Information Processing Systems
Entropy Loss on validation set. We further remark that on Text Classification, all models quickly fall into over-fitting, and thus the validation losses rise quickly. Results are averaged across one random batch from the test set in each LRA task. Such matrices are considered more informative since they are harder to approximate, requiring more ranks even in the truncated SVD approximation. This section introduces some useful facts, which are key in the proof in the next section.
Neural Information Processing Systems
Oct-2-2025, 09:53:13 GMT