Recent works proposed various linear self-attention mechanisms, scaling only asO(L)for serial computation. We conduct a thorough complexity analysis of Performers,aclass which includes most recent linear Transformer mechanisms.
The main question about the spiked Wigner model is: how large should the signal-to-noise ratio ฮป > 0 be in order to achieve constant correlation withx?
What makes a classifier have the ability to generalize? There have been a lot of important attempts to address this question, but a clear answer is still elusive.