SMYRF - Efficient Attention using Asymmetric Clustering

Neural Information Processing Systems 

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from O(N 2) to O(N \log N), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g.