Appendix: Remodel Self-Attention with Gaussian Kernel and Nyström Method

Neural Information Processing Systems 

Entropy Loss on validation set. We further remark that on Text Classification, all models quickly fall into over-fitting, and thus the validation losses rise quickly. Results are averaged across one random batch from the test set in each LRA task. Such matrices are considered more informative since they are harder to approximate, requiring more ranks even in the truncated SVD approximation. This section introduces some useful facts, which are key in the proof in the next section.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found