Supplementary for SOFT: Softmax-free Transformer with Linear Complexity

Neural Information Processing Systems 

According to the eigenfunction's definition, we can get: null k (y,x)φ Li Zhang (lizhangfd@fudan.edu.cn) is the corresponding author with School of Data Science, Fudan In our formulation, instead of directly calculating the Gaussian kernel weights, they are approximated. More specifically, the relation between any two tokens is reconstructed via sampled bottleneck tokens. However, it turns out to suffer from a similar failure. For each model, we show the output from the first two attention heads (up and down row). Attention is all you need.