asymmetric clustering
SMYRF - Efficient Attention using Asymmetric Clustering
We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$, where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g.
SMYRF - Efficient Attention using Asymmetric Clustering
We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from O(N 2) to O(N \log N), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g.
Review for NeurIPS paper: SMYRF - Efficient Attention using Asymmetric Clustering
This paper proposes a method for reducing the quadratic bottleneck of transformer architectures to O(N log N), using an asymmetric LHS clustering strategy. The paper also shows that finding an optimal assignment is NP-hard and thus, heuristic approaches must be pursued. They propose a novel type of balanced clustering algorithm to approximate attention. The method can be directly used for pre-trained models and achieves competitive/better performance with BigGAN/BERT/RoBERTa by shrinking 50% memory. There was some disagreement among reviewers about this paper, with R1 and R3 recommending solid acceptance, and R2 and R4 recommending weak reject.
SMYRF - Efficient Attention using Asymmetric Clustering
We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from O(N 2) to O(N \log N), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g.